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Abstract. The imposition of lockdowns in response to the COVID-19 
outbreak has underscored the importance of human behavior in mitigat- 
ing virus transmission. The scientific study of interventions designed to 
change behavior (e.g., to promote physical distancing) requires measures 
of effectiveness that are fast, that can be assessed through experiments, 
and that can be investigated without actual virus transmission. This 
paper presents a methodological approach designed to deliver such in- 
dicators. We show how behavioral data, obtainable through wearable 
assessment devices or camera footage, can be used to assess the effect 
of interventions in experimental research; in addition, the approach can 
be extended to longitudinal data involving contact tracing apps. Our 
methodology operates by constructing a contact network: a representa- 
tion that encodes which individuals have been in physical proximity long 
enough to transmit the virus. Because behavioral interventions alter the 
contact network, a comparison of contact networks before and after the 
intervention can provide information on the effectiveness of the interven- 
tion. We coin indicators based on this idea Behavioral Contact Network 
(BECON) indicators. We examine the performance of three indicators: 
the Density BECON, the Spectral BECON, and the average shortest 
path length (ASPL) BECON. First, the Density BECON is based on 
differences in network density, i.e., differences in the portion of realized 
edges (connections) relative to all potential edges. Second, the Spectral 
BECON is based on differences in the eigenspectrum of the adjacency 
matrix, which capture the spreading potential of the virus. Third, the 
ASPL BECON is based on differences in the mean of all the shortest dis- 
tances (i.e., number of edges) between each pair of nodes in the network, 
and captures the average distance between nodes. Using simulations, we 
show that all three indicators can effectively track the effect of behav- 
ioral interventions. Even in conditions with significant amounts of noise, 
BECON indicators can reliably identify and order effect sizes of inter- 
ventions. The present paper invites further study of the method as well 
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as practical implementations to test the validity of BECON indicators 
in real data. 


Keywords: Contact networks - Network analysis - Epidemiology - Virus 
spread - Interventions 


1 Introduction 


The COVID-19 outbreak has underscored the importance of human behavior 
in controlling virus transmission. As long as vaccines are not operational, the 
only way to influence transmission rates is through behavioral interventions that 
either prohibit specific kinds of behavior (e.g., attending school, visiting relatives, 
leaving the house) or promote others (e.g., physical distancing, wearing masks, 
complying with regulations). As such, behavior is fundamental to important 
parameters in epidemiological models, such as the reproduction number (the 
number of people a randomly chosen disease carrier is expected to infect): even 
though virus transmission depends on biological characteristics of the virus and 
the human system, its speed reflects an interaction between biology and behavior 
ors) 
Indeed, one way of understanding the reasoning behind lockdowns is that they 
try to drive down the reproduction number by changing behavioral patterns 
[Jeffrey et al. [2020]. 'The goal of this paper is to contribute 
to our understanding of these behavioral patterns, by developing methodological 
tools that can be used to study them. 

'To successfully monitor and control our responses to a virus outbreak like 
COVID-19, we need to obtain insight into the relative effectiveness of different 
behavioral interventions. Relevant behavioral interventions can either be imple- 
mented at a microlevel (e.g., setting up nudges in a store to promote physical 
distancing, changing the floor plan of a restaurant), or a macrolevel (e.g., imple- 
menting public policy measures that promote working from home, closing public 
buildings). Currently, however, methodology for estimating effects of such inter- 
ventions at the behavioral level is limited to highly indirect assessments based 
on measures of virus spread. For example, comprehensive assessments of inter- 
ventions at the macrolevel 2020) have been estimated based on the 
relation between country-level interventions (e.g., school closings, lockdowns) 
and corresponding population statistics (e.g., hospital admissions, IC uptakes, 


death rates; see for example |Flaxman et al.||2020); or they have been treated 


as model parameters to assess the time-course of the epidemic under different 
scenarios — the well-known study by [Ferguson et al.| (2020), which has played an 
important role in COVID-19-related policy, is a case in point. 

There are at least three methodological reasons why indicators such as hos- 
pital admissions are of limited use in assessing effects of behavioral interventions 
designed to counter virus spread. The first problem is that they are lagged in- 
dicators. Evaluating the effect of interventions with hospital admissions as a 
dependent variable suffers from the time course of virus transmission, incuba- 
tion, and disease progression, before one can assess where the intervention has 
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been effective (this delay was two to three weeks for COVID-19). The second 
problem concerns ezperimental inaccessibility. If one studies an intervention that 
is strongly suspected to be effective, it is unethical to install a control group for 
comparison and to wait for participants to become ill. T'he next best alternative 
— a quasi-experimental research setup — suffers from considerable levels of con- 
founding, and because interventions are almost always implemented in packages 
it is hard to disentangle their effects. T'hird, current indicators require an active 
virus. Thus, in a period in which there is no virus active, it is impossible to 
study the effects of behavioral interventions. This is strategically impractical as 
it would be ideal to study behavioral interventions while the virus is inactive 
in order to prepare for a possible future outbreak. Moreover, the COVID-19 
pandemic is unlikely to be the last global pandemic and research in effective 
interventions will remain important, even after the current crisis has ended. 

The scientific study of behavioral interventions thus requires indicators that 
are fast, that can be assessed through experiments, and that can be investigated 
without actual transmission of the virus. In the present paper, we develop a 
methodological approach designed to deliver such indicators. In a nutshell, we 
make use of the fact that behavioral data, obtainable through wearable devices, 
camera data, or tracing apps can be used to assess contact networks 
[2020]. This can either be done at the microlevel (e.g., assessing con- 
tact patterns at a public gathering) or at the macrolevel (e.g., reconstructing 
contact networks at the level of a city on the basis of tracing apps). We coin 
indicators based on such networks Behavioral Contact Network (BECON) in- 
dicators. Because BECON indicators are available in real time, they respond 
to induced changes in contact networks virtually instantaneously; and because 
they do not require actual transmission of the virus, they can be used to as- 
sess effectiveness in healthy subjects, which in turn means they can be studied 
in experiments. As such, BECON indicators are suited to make the connection 
between epidemiology and behavior, and thereby allow behavioral scientists to 
leverage their knowledge and skills in developing optimal interventions to control 
the pandemic. 

'The structure of this paper is as follows. First, we will outline the theoretical 
basis of our approach. Second, we discuss the methodological strategy behind 
BECON indicators in more detail. Third, we present a simulation study that 
serves as a proof of concept. Finally, we discuss future extensions of our work. 


2 Behavioral interventions and the contact network 


To understand the relation between behavior and epidemiology, it is important 
to introduce an essential mediator in this relation: the contact network. A con- 
tact network encodes which people have been sufficiently close to each other to 


transmit the virus (Newman| [2018] Pastor-Satorras, Castellano, Van Mieghem, 
& Vespignani,|2015). In contact networks, individuals (or groups of MEME 


are represented as nodes, similar to the representation used in well-known social 
networks. Two nodes are connected by a link if the corresponding individuals 
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have been in sufficiently close prolonged contact for virus transmission to occur, 
and disconnected otherwise. Exactly what "sufficiently close" means depends 
on the virus in question. For Ebola, which is only transmissible through bodily 
fluids [2014], a link in the contact network would mean that the 
corresponding individuals were in direct physical contact. For SARS-CoV-2, a 
link could be present when two individuals have been within a distance of 1 or 2 
meters of each other for some time, given its airborne transmission (CDC} 2020). 


Virus spread on a contact network can be conceptualized as a process in which 
nodes infect each other via the links in the contact network 
(al. 2015]. Usually, a closed population is divided into epidemiological *compart- 
ments": each individual of the population can be only in one compartment at a 
time and the compartments describe stages of the disease. Typical examples of 
compartments include S (susceptible), E (exposed), I (infectious), R. (removed, 


i.e., either cured or deceased) (Keeling & Rohani| 2011). Mathematically, virus 


spread is a probabilistic process that operates on the contact network topology 
and that specifies infection and curing 
events, i.e., how long a person is infectious and when the person is cured or 
deceased. The time distribution of these events, and the local rules at the host 
(i.e., what happens if a person is in state S, E, I, or R), depend on specifics 
of the virus in question; for instance, for COVID-19, the consensus during the 
SARS-Cov-2 variants operative in 2020 held that people were infectious for an 


average of about 6-7 days (Backer, Klinkenberg, & Wallinga, |2020}. 


Once the structure of the contact network and the compartment model is 
specified, the probability or average fraction of individuals in each compartment 
can be computed per unit time 2013). If 
the contact graph does not change too much over time, other global properties 
of the virus spread can be determined. An important property is the epidemic 
threshold, which is related to the basic reproduction number R 
[2015], and describes the conditions under which outbreaks can occur. If 
the contact network changes over time, then we enter a complicated situation 
in which computer simulations are necessary to study virus spread. Another 
approach is to map the contact graph into a certain class, similar as the classes 
of Erdós-Rényi or Barabasi-Albert random graphs, and properties of such a 
contact graph class can be deduced, in principle, analogously [2018]. 
'Thus, we assume that the time-dependent network has similar properties as the 
properties of the class and thus abstract the temporal changes in time. Finally, 
novel approaches based on the analysis of time series data can be used to include 
the dynamic changes of the network in the analysis 2021). In the 
current paper, we focus on the simplest case, namely one in which the contact 
network is stable over time so that it can be characterized by a single network 
structure. 


Behavior, contact networks, and compartmental epidemiological models are 
strongly related: behavior controls the structure of the contact network, the 
contact network directs the spread of the virus, and the spread of the virus 
determines population statistics of the epidemiological compartments. Figure[1] 
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Figure 1. Different levels of analysis that are used in the scientific study of infec- 
tious diseases like COVID-19. At the microlevel, human behavior determines physical 
distances between individuals that are crucial to virus transmission. The resulting pat- 
tern of physical distances can be aggregated into a contact network at the mesolevel, 
in which nodes represent individuals and edges represent physical contacts that make 
virus transmission possible. These transmission processes determine how many people 
get infected and at what rate; at the macrolevel, mathematical models based on differ- 
ential equations are used to model these quantities. 


represents this hierarchy of levels visually. This has ramifications for how we 
should think about behavioral interventions: interventions (e.g., instructing peo- 
ple to practice physical distancing) lead to behavior change (e.g., people will 
keep more distance), which causes contact networks to change (e.g., the number 
of links in the network may decrease). These changes cascade into population 
level statistics (e.g., the value of R will go down) that eventually determine 
policy success (e.g., number of hospital intakes will stay within limits defined in 
policy considerations). In accordance, a central idea underlying our framework is 
that, in order to connect interventions to epidemiological models, they should be 
represented as operations that transform the contact network 
[2015); the present paper applies this idea to behavioral interventions. 


This analysis opens up an important methodological possibility: if behavioral 
interventions operate through changes in the contact network, then measures of 
that contact network could in principle be used to assess the effect of such 
interventions. Such an approach would address each of the problems highlighted 
in the introduction. First, assessment of the contact network can be executed 
instantaneously, addressing the lagging indicator problem. Second, because the 
contact network depends only on whether individuals are sufficiently close to each 
other, and not on whether they actually infect each other, we can potentially pick 
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up changes in the contact network in the absence of actual virus transmission (of 
course, applications of insights gained would still require knowledge of how active 
transmission works). Third, as a consequence of these two properties, assessments 
of the contact network can be implemented in experimental designs without 
raising ethical concerns of exposing individuals to virus spread. As a result, the 
study of interventions would no longer be limited to assessments of policy effects 
at the societal level but could also 
be used to study manipulations at a much smaller microlevels (e.g., in specific 
locations like public buildings, restaurants, or concerts). Implementation at the 
microlevel in turn facilitates the type of controlled experimental research that 
characterizes psychology (e.g., the implementation of interventions in factorial 
designs). In the next paragraph, we show how contact networks may be assessed 
to construct indicators that allow for such approaches. 


3 BECON indicator methodology: Strategy and rationale 


The idea behind BECON indicators is to assess (functions of) the contact net- 
work on the basis of behavioral observations. In the current paper, we will focus 
on assessments of the contact network using wearables or contact tracing apps 
that are designed to register whether a person has been within a certain distance 
(e.g., 1.5 meters) of another person. This methodology has the advantage that 
it measures a proxy to actual behavior (rather than, e.g., relying on self-reports) 
and that it does not require a controlled environment, so that it can in principle 
be used in daily life, enhancing ecological validity. 

A schematic of the proposed methodology is represented visually in Figure 
The interactions between people in the baseline situation (i.e., the situation 
without the behavioral intervention of interest being implemented) give rise to 
the true baseline contact network (left bottom). 

The true contact network is most likely not directly observable. First, as 
noted, the presence of a link depends on a theory of virus transmission, which 
is approximate. For example, in the case of COVID-19, the presence of aerosol 
transmission or infection via surfaces can create links between people who are at 
a greater distance than 1.5 meters, or between people who were present at the 
same place at distinct time points (e.g., because the aerosols remain present in 
bathrooms after the infectious person has left). Second, if the network integrates 
contacts over time (e.g., by taking the union of all contact networks at each 
time point, which registers during a certain time interval who has ever been in 
contact with who, but not when), the representation will contain false positive 
connections; for instance, when A and B were in contact, and subsequently B and 
C were in contact, then the patterns of links suggests that both A > B — C and 
C — B — A are possible infection routes, while only the former route is possible 
(see also [Dekker et al.] (2021)). Third, various kinds of measurement errors can 
yield false positives and false negatives. In a situation in which tracking devices 
are used, examples of mechanisms that can lead to measurement errors may 
include hardware failures, signal failures, and failures in data processing. 
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Observed baseline contact network Observed experimental contact network 
App data App data 


True baseline contact network True experimental contact network 


Figure 2. Schematic plan of the proposed methodology. Prior to intervention, the con- 
tact network (lower left panel) generates app data that are used to partially reconstruct 
the network structure (left top panel). Under the intervention, the density of the con- 
tact network decreases (lower right panel). This network is also reconstructed on the 
basis of app data (upper panel). The difference between the reconstructed networks is 
used to construct a BECON indicator to assess the effect of the intervention. 


'Thus, the true baseline contact network is generally not observable and diffi- 
cult to assess adequately. In the present paper we therefore represent this network 
as a latent structure. To assess this latent structure, we can use observations, 
as for instance obtained through smartphone apps, video footage analysis, or 
wearable sensors. For instance, a simple starting point may be to have a group 
of people use wearable devices that track their location. Measured location data 
may then be used to decide whether two people have been in contact. Thus, 
the structure of the data required to construct a contact network is of the form 
(AB, BC,...,Y Z] which encodes that persons A and B, B and C, ..., and Y 
and Z have been within 1.5 meters of each other for the amount of time specified 
in the definition of a contact. This is called an edge list in network science, which 
can be transformed into an adjacency matrix. From this matrix, many impor- 
tant metrics in network analysis can be computed 2018). We denote 
the network encoded in the empirically derived adjacency matrix the observed 
baseline contact network (top left in Figure). 

Next, we implement a behavioral intervention. For example, we might in- 
struct people to keep their distance, put up signposts, install an alarm on their 
phones that sounds when they get too close, or use a variety of nudges that 
promote physical distancing. If effective, the intervention changes people's be- 
havior, and as such induces changes in the contact network. We denote the 
resulting network as the true experimental contact network. Like the baseline 
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contact network, the true experimental contact network is not directly observ- 
able, but can be assessed indirectly through location measurements. Such data 
may again be obtained by letting people use wearables in the experimental situ- 
ation, after which the results can be used to arrive at an observed experimental 
contact network. 


Recall that a behavioral intervention effect is, in essence, a transformation of 
the contact network. Hence, if we could directly assess the baseline and exper- 
imental contact networks, we could precisely determine the effect of the inter- 
vention and, given a dynamical regime, we could also assess the degree to which 
the intervention should be expected to mitigate virus spread. Unfortunately, 
however, we cannot directly compare the baseline and experimental networks, 
as these are only indirectly observable. However, we can directly compare the 
observed networks created through measurements. For example, one could com- 
pute the number of links in the observed experimental contact network, and 
compare that quantity to the number of links in the observed baseline network. 
This way, one could assess whether, on average, people keep more distance in 
the experimental condition. From a bird's eye perspective, observational studies 
have shown a substantial decline in average mobility after lockdowns have been 
enforced (e.g., [2020]. However, one could also utilize a variety 
of more advanced network metrics which can provide a detailed picture of the 
changes that the intervention has produced. For example, one could compare the 
networks for their density, diameter, average shortest path lengths, etc. to assess 
the effect of specifically targeted interventions (e.g., interventions that target 
specific individuals central in the contact network, like doctors and teachers). 


'The function that is chosen to assess the difference between networks defines 
a Behavioral Contact Network indicator; a BECON. In this paper, we develop 
three BECONS: The Density BECON, the Spectral BECON, and the ASPL BE- 
CON. To facilitate interpretation, each of the BECON indicators is constructed 
in such a way that a higher value on the indicator implies that the experimental 
network has changed the network in a direction that would in typical circum- 
stances be expected to limit the potential for a virus to spread. The indicators 
studied here are are defined as follows. 


'The Density BECON uses the relative change in network density. Network 
density is defined as the ratio of the number of links to the total number of pos- 
sible links in the network. This measure is epidemiologically relevant, because 
denser networks indicate that more people have been in close proximity to each 
other. The Density BECON is constructed by dividing the density of the ob- 
served baseline contact network by that of the observed experimental contact 
network, where higher values indicate larger experimental effects. In essence, this 
measure simply tracks the extent to which the number of contacts reduces in the 
experimental condition. This measure would be most relevant in microlevel ap- 
plications, where people for instance use wearables during a public event, because 
in this case only the direct contacts are relevant. This is because for COVID-19, 
a person will take several days from infection to being infectious; hence, indirect 
connections that would lead to transfer from one person to another via a third 
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person (A — B — C) are not possible on such short time-scales. This metric 
is particularly suitable in microlevel applications when the period from infec- 
tion to being infectious takes multiple days, as was the case with COVID-19. In 
this situation, indirect connections that would leads to transfer from one person 
to another via a third person (A — B — C) are not possible and, hence, the 
(dynamical) contacts can be concatenated into a single contact network. 


'The Spectral BECON is based on the spectral radius. The spectral radius is 
the largest eigenvalue of the adjacency matrix. The spectral radius is epidemi- 
ologically important, because the inverse of the spectral radius is equal to the 
so-called mean-field epidemic threshold, which is in turn a lower bound to the real 
epidemic threshold (Van Mieghem & Van de Bovenkamp] 
& van de Bovenkamp| 2015). The epidemic threshold plays a central role in net- 
works and copes with structural heterogeneity, because a viral strength above 
the epidemic threshold will endemically infect a non-zero fraction of the nodes 
in the network. In the limiting case of a complete graph on N nodes, where 
all nodes are connected to each other, the epidemic threshold is approximately 
equal to 1/N and the strength of the virus divided by the epidemic threshold is 
approximately equal to the reproduction number, whose critical value is equal to 
one (for a reproduction number larger than one, the model predicts the virus to 
be endemic, while for values below one it predicts that the virus will eventually 
disappear). Moreover, one may control and tune the contact network so that its 
spectral radius is minimized and the vulnerability for infections (virus spread) is 
maximized [2011]. 'The Spectral BECON is constructed by 
dividing the largest eigenvalue of the adjacency matrix of the observed baseline 
contact network by that of the observed experimental contact network, such that 
a larger value indicates a larger intervention effect. The Spectral BECON would 
not be relevant in microlevel research (e.g., tracking people in a location over 
several hours), but rather would apply to measures taken over days, as could be 
gained using contact tracing apps. À specific example would be contact tracing 
within the workspace, where the same group of people met each other over longer 
periods of time. Different set-ups and interventions could be tried out (e.g., dif- 
ferent ‘bubbles’ of employees, different organisation of common spaces, walking 
routes), and their effectiveness could be assessed using the Spectral BECON. 
'The Spectral BECON is thus useful to assess intervention effects that involve 
changes in the network structure that not only affect the number of links per 
node, but work on the architecture of the network as a whole. 


The ASPL BECON uses the average shortest path length (ASPL) between 
pairs of nodes in the network. The shortest path length (SPL) between two nodes 
equals the minimum number of edges that one has to traverse to travel from one 
node to the other; the ASPL is the average value of all SPLs between all pairs of 
network nodes. The ASPL is relevant to virus transmission, because the shorter 
the paths that connect nodes in a contact network are, the easier the virus can 
spread from one person to randomly chosen other person. The ASPL BECON is 
constructed by dividing the ASPL of the observed experimental contact network 
by that of the observed baseline contact network. Like the Spectral BECON, 
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the ASPL BECON could be applied in research where contacts are traced over 
a period of days, in which indirect connections (A — B — C) are relevant. For 
calculation of the ASPL BECON we ignore any infinite shortest path lengths 
that arise from disconnected graphs (i.e., in the case when some nodes or groups 
of nodes are unconnected). Thus if there was more than one single connected 
component (which happened in less than 396 of the cases), paths between nodes 
in different connected components did not exist. We hence ignored these when 
computing the ASPL BECON. 

'The fact that each is expressed as a ratio allows one to interpret the BECON 
values directly: for instance, if the Density BECON equals 2, this means that the 
density of the observed baseline contact network is twice as large as that of the 
observed experimental contact network. Other things being equal, a higher BE- 
CON value would indicate that the intervention would likely be more successful 
in mitigating virus spread (naturally, this should be considered in the light of 
a theory about the virus transmission process). In research at the microlevel, 
where people are traced over periods of hours, the Density BECON would be 
most relevant. 

An important methodological question is whether we can use BECON indi- 
cators to assess the effect of interventions in realistic circumstances, where our 
assessments of the contact network will be distorted in various ways. Thus, the 
question that arises is whether the setup sketched in Figure |2| can be used to 
assess the effect sizes of experimental manipulations in realistic conditions. For 
example, can one order the effects of a set of behavioral interventions in terms of 
effect size? How does the methodology fare in the presence of realistic amounts 
of measurement error? To assess whether this is indeed possible, we now turn to 
a simulation study. 


4 Simulation study 


The simulation study is designed to evaluate whether the BECON methodology 
is indeed able to pick up effects of interventions if these are present. To show 
this, we vary the size of intervention effects on the true contact network, and 
subsequently assess the corresponding BECON values in the observed contact 
network which is subjected to various levels of noise. If the method is reliable, we 
expect the BECON values to be higher if the effects are stronger. We evaluate 
this by computing the correlation between the size of the simulated intervention 
effect and the observed BECON values: the higher the correlation is, the more 
reliable the BECON is as an indicator of intervention effects. 

In the simulation, we vary the number of nodes in the network n € [100, 200, 
500, 1000], specifying the architecture of the contact network as a small world 
structure (Watts & Strogatz| and generate it using the R package igraph 
(Csardi & Nepusz| |2006), with a neighborhood size of K = 5 and a rewiring 
probability of p — 0.10. We use this specification because it leads to a network 
structure with a high degree of clustering yet small average shortest path lengths, 
which is qualitatively similar to the structure of some social networks found in 
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empirical research, where links may e.g., encode whether individuals work at the 
same place or visit the same sports club. This network serves as our baseline 
contact network (Figure [2] left bottom). 


Next, we simulate a measurement process (Figure[2| left arrow). The process 
generates imperfect measures of the baseline contact network. We simulated a 
measurement function with a false negative rate of fn = [0.10, 0.20, 0.30] (e.g., 
fn = 0.30 implies that only 70% of the network links were successfully picked 
up) and false positive rate fp = [0.10, 0.20, 0.30] (e.g., fp = 0.10 implies that 
1096 of the links that are absent in the true contact network will be present in the 
observed network). We emphasize that, in the present paper, our interest is not 
to retrieve the actual network structure by correcting for these distortions or to 
recover the dynamical processes that it supports. Instead, our primary interest 
here lies in the pragmatic goal of assessing the effect of interventions, so as to 
develop an indicator that can serve to bridge the gap between epidemiology and 
behavioral science. 


Subsequently, we assess the observed baseline contact network using the ob- 
tained data (Figure B] top left). This network can be quite severely distorted as 
a result of the probabilistic nature of the measurement function. For instance, 
as is visible in the figure, the observed network is much denser than the true 
contact network, even though the false positive rate is much lower than the 
false negative rate. T'his is because the true contact network is relatively sparse, 
which means it contains more absent than present links: the small world net- 
works we generated for n — 100 individuals have n x (n — 1)/2 — 4,950 possible 
links, of which on average only K x n — 500 are actually present. Hence, a false 
positive rate of 1096 generates about 0.10 x (4, 950 — 500) — 445 false positive 
links, while the false negative rate of 3096 means that on average only 350 of the 
500 actual links are successfully identified. In other words, the observed baseline 
contact network contains about 445 4- 350 — 795 links, of which only 350 are 
true positives. Because links feature such a low base rate, the probability that 
two nodes that are connected in the observed network are actually connected in 
the true contact network is only 350/795, i.e., about 0.44. This effect is stronger 
in larger networks, because in larger networks there are more opportunities for 
false positives; for example, in a network of 1000 individuals, the probability that 
an observed link is actually present in the true network is only 0.07. Thus, the 
actual assessment of the network in itself may be largely unsuccessful, especially 
in larger networks. We think that similar results would be expected in actual em- 
pirical work if the studied contact networks are sparse. However, as we will see, 
the lack of success in assessing the contact network itself does not preclude the 
possibility of assessing intervention effects by comparing the observed networks. 


We now implement an intervention on the network (Figure[2| bottom arrow). 
A typical intervention would be intended to, e.g., improve physical distancing, 
and hence should lead to a lower connectivity in the experimental contact net- 
work (Figure [2] bottom right). We model such an intervention by deleting links 
uniformly at random (a process known as bond percolation in the network lit- 


erature; 2018). The proportion of links deleted from the baseline con- 
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tact network then defines the effect size of the intervention. We vary this effect 
size in steps of 1096, from an ineffective intervention, which deletes no links, to 
the strongest intervention, which deletes 90% of the links. We implement this 
intervention in two ways: the deterministic intervention removes the specified 
proportion of links exactly, by randomly deleting links until this proportion is 
reached, while the probabilistic intervention removes each link with a probability 
equal to the specified proportion. Thus, in an intervention setting with effect size 
of, say, 0.40, the deterministic intervention results in a network where exactly 
4096 of the links are deleted, while the probabilistic intervention removes each 
link with a probability of 0.40, so that the expected percentage of removed links 
equals 4096 while each realization may be different. The probabilistic intervention 
thus implements a situation in which the interventions are represented as a ran- 
dom effect that differs across experimental settings, leading to more uncertainty. 
Importantly, these interventions are but two of the many alternative and possi- 


bly directed interventions that one may study (Trajanovski, Martín-Hernández, 
Winterbach, & Van Mieghem} |2013). 


Finally, we implement the same measurement function as before on the ex- 
perimental contact network. This way, we arrive at the observed experimental 
network (Figure B] top right). As was the case for the baseline contact network, 
this assessment is dominated by false positives, which leads to a significant over- 
estimation of the network density. In addition, the fact that the experimental 
contact network is sparser than the baseline contact network implies that the 
probability of a link being present in the experimental network, given that it is 
present in the observed experimental network, has diminished even more. 


Table 1. Simulation results across conditions for a small world graph. The table re- 
ports correlations (means and sd) between BECONs and intervention effect sizes for 
deterministic and probabilistic interventions across simulation runs for multiple net- 
work sizes. The results are averaged over the false negative rates. 


n FP Deterministic Probabilistic 
Density Spectral ASPL Density Spectral ASPL 
rSD rSD rSDr SDr SDr SD 


100 0.1 1 01 01 0.97 0.03 0.96 0.04 0.97 0.03 
100 0.2 1 01 01 0.92 0.05 0.92 0.06 0.92 0.05 
100 0.3 1 01 01 0.86 0.10 0.86 0.10 0.85 0.11 
200 0.11 01 01 0.97 0.03 0.96 0.03 0.97 0.02 
200 0.2 1 01 01 0.92 0.06 0.91 0.06 0.91 0.06 
200 0.3 1 01 01 0.86 0.10 0.85 0.10 0.85 0.11 
500 0.11 01 01 0.96 0.03 0.96 0.03 0.977 0.02 
500 0.21 01 01 0.92 0.077 0.92 0.06 0.91 0.07 
500 0.3 1 01 01 0.85 0.11 0.84 0.11 0.84 0.11 
1000 0.1 1 01 01 0.96 0.03 0.96 0.03 0.96 0.03 
1000 0.2 1 01 01 0.91 0.06 0.91 0.06 0.91 0.06 
1000 0.3 1 01 01 0.85 0.10 0.85 0.10 0.85 0.10 


cOcocccccoccoccococoocco 


The Lighting of the BECONs 13 


As a measure of the accuracy of the BECONS in ordering the intervention 
effect sizes, we compute the average correlation between the estimated BECONs 
and the actual intervention effect across simulation runs. A correlation of unity 
then means that the BECONS order the interventions perfectly, while a correla- 
tion of 0 means that the BECONs do no better than chance. 


Results of the simulations are given in Table 1. As can be seen from the 
table, despite the fact that the networks used to compute the BECONs were 
poor representations of the actual contact networks, the difference between the 
observed networks tracks the intervention effect size reliably. To illustrate how 
these effects arise, Figure |3| gives a detailed representation of results for one 
specific BECON in the simulation design, i.e., the ASPL BECON performance 
with fp — 0.20 and fn — 0.20. Detailed representations of simulation results 
for all BECONSs across all conditions are given in the Appendix. As shown in 
detail in the Appendix, the results in F igure[3]are representative of the behavior 
of all three BECONs, which uniformly varied as a monotonic function of effect 
size, in the sense that the probability distributions of these statistics stochas- 
tically order the interventions. The separation of effect sizes is in fact perfect 
for all deterministic intervention simulations. In the probabilistic intervention 
simulation, results are somewhat more attenuated, but the correlation between 
true and estimated intervention effects does not drop below 0.80. Thus, even 
with sizeable false positive and false negative rates, the methodology still works 
extremely well. 


As can be seen in Figure [3| and the extended results in the Appendix, BE- 
CONS are monotonically related to intervention effect sizes, such that stronger 
effects result in higher BECONSs. Yet, for larger networks, the increase in BE- 
CON attenuates (i.e., the slope becomes smaller). This can be explained by 
the small world structure of the networks, which makes larger networks rela- 
tively more sparse compared with smaller networks. As a result, a fixed false 
positive rate in the measurement process (e.g., 20% false positives) has a more 
pronounced effect in the larger networks. This is because the number of present 
links grows linearly with the number of nodes, while the number of possible links 
grows quadratically, and therefore larger networks feature a smaller percentage 
of present links. For example, a neighborhood size of 5 results in the presence of 
about 196 of the possible links in a network of 1000 people (5, 000 out of 449, 500), 
but the presence of about 1096 of the possible links in a network of 100 people 
(500 out of 9,900). As this measurement process applies to both the observed 
baseline network and to the observed experimental network, the false positive 
rate will make larger networks relatively more similar to each other than smaller 
networks. Consequently, it is more difficult to detect an effect in larger networks, 
which is reflected in the BECONs. However, as is evident from Table 1, BECON 
performance is robust against this effect across the conditions simulated. 


Finally, we find essentially the same results for a scale-free network, that is, a 
network whose degree distribution follows a power law, and for an Erdós-Rényi 
random graph, which has a binomial degree distribution. Results for these net- 
work structures as well as code to reproduce them are available at 
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Figure3. ASPL BECON indicator values as a function of intervention effect sizes. 
Plots display the range that contains 9596 of the observed BECON values in the relevant 
condition. The true network structure is a small-world network. This figure displays 
results for a setting with false positive and false negative rate of 0.20 


gitlab.com/science-versus-corona/becon, Performance is broadly consis- 


tent across network structures. The reason for this may lie in the noisy mea- 
surement process: the false positive rate results in a substantial amount of new 
links, which obscure the original structure and thereby may counter effects of 
network structure. 


5 Discussion 


In this paper, we have introduced a behavioral data science methodology to 
study interventions designed to counter virus spread, and have shown in simu- 
lations that this methodology is both feasible and effective. BECON indicators, 
constructed to pick up changes in the contact network that are induced by inter- 
ventions, were shown to track intervention effects reliably under realistic mea- 
surement conditions that are characterized by substantial levels of noise. Thus, 
BECON indicators offer a promising methodological approach to studying in- 
terventions designed to mitigate virus spread in COVID-19 and other infectious 
diseases. 
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Because BECON indicators are instantaneous and can be constructed in the 
absence of actual virus spread, they address the main problems that plague 
widely used indicators, such as hospital admissions and death rates. In contrast 
to current indicators, BECON indicators do not suffer from lags and can be 
used in experimental designs; in addition, if used in the absence of actual virus 
transmission, the use of BECON indicators need not put participants in control 
conditions at risk. Therefore, BECON indicators have the important advantage 
that they can be used when viruses are not present, i.e., they allow us to maxi- 
mize our defenses and to prepare for new outbreaks. Of course, the interpretation 
of the results does depend on details of the virus transmission process, and an 
important question is how to relate changes in the observed contact networks to 
this process. 


As indicated, the study of which BECON is best suited for a given research 
question depends on a combination of factors including characteristics of the 
virus, properties of the research design, and the goals of the research program. 
One important issue, highlighted throughout this manuscript, involves the time 
scale at which the research program runs in relation to the time scale at which 
a virus spreads. In cases where people are observed for a shorter time than that 
needed for the virus to incubate and become infectious, the Density BECON 
is always best, because the virus cannot travel more than a single step in the 
network. In cases where people are followed for a period that exceeds this pe- 
riod, the Spectral and ASPL BECONSs become feasible alternatives. If one wants 
to study effects of generic interventions (e.g., lockdowns or school closures) on 
virus spreading potential, then the Spectral BECON is indicated due its close 


relation to the epidemic threshold (Van Mieghem & Van de Bovenkamp} |2013 
Van Mieghem & van de Bovenkamp| 2015). Finally, in cases where one has inter- 


ventions that are specifically targeted at the path lengths in the network (e.g., 
interventions targeted to limit interactions based on a previous interaction his- 
tory), one can use the ASPL BECON. Thus, generally, we would recommend the 
Density BECON in all situations where the research design observes behavior 
at a duration below that needed for the virus to spread, and the Spectral and 
ASPL BECONS in research designs that observe behavior for longer durations; 
the choice between Spectral and ASPL BECONSs then depends on specifics of 
the research question. 


'The Density BECON offers a simple metric to be used in small scale experi- 
ments, where one for instance wants to test the effectiveness of office designs in 
a company, or where one desires to assess the relative effect of different nudges. 
We performed a study in which we applied the BECON methodology in practice. 
During an art fair, we implemented different nudges and evaluated its effect on 
the contact network. This way we could for example show that walking direc- 
tions positively impacted physical distancing and reduced the number of con- 
tacts, demonstrating the effectiveness of the proposed methodology in practice 
2021). Simulations indicate that even for small 
networks the indicators are reliable. However, BECON indicators are potentially 
also applicable to large scale research. For example, using contact tracing apps, 
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it should be possible to assess contact networks at the scale of neighborhoods, 
cities, or even countries. Thus, BECON indicators could be implemented in a 
dashboard used by policy makers to assess the degree to which current policies 
are on track. Because they are much faster than traditional indicators, they may 
also be highly useful in contributing to alarm systems that indicate that policy 
action is required. 

While our approach aims to change the network structure by deleting links, 
a closely related notion is the removal of nodes (known as site percolation; 
[2018]. 'This is especially important in vaccination campaigns, especially if 
vaccinating an individual leads to a situation where the vaccinated individual 
cannot receive nor spread the disease Strictly speaking, 
vaccinating an individual does not change the network structure. For the pur- 
poses of disease spreading, however, one may reformulate it as such: in cases 
where vaccinating an individual ensure that the individual will no longer be able 
to spread the disease, this implies that all links going into and out of the node 
will be removed. Although it focuses on node rather than link removal, research 
done on animal populations is related to the approach we propose here. 
Semple, Morrogh-Bernard, Zuberbuchler, and Lehmann] (2013), for example, usc 
simulations to study how targeted vaccination (e.g., vaccinate the most central 
nodes) outperform random vaccination in a network derived from observations 
of orangutans and chimpanzee populations, respectively. They used cluster size 
and shortest paths of the network as measures to assess the effect of interventions 
(see also E RR RITE Rot d the network structure to 
the final outbreak size, (2014) find in simulations that vaccinat- 
ing the most connected chimpanzees can reduce the outbreak size considerably. 
While these articles focus on node removal, we expect that our approach can 
learn from such approaches; future research may explore this line more fully. 

Several limitations to the present work should be noted. For example, as we 
have emphasized throughout this paper, observed networks will ordinarily be 
poor representations of the contact networks of interest unless contact network 
assessments are augmented by more advanced measurement methods. Therefore, 
one should be very careful in using observed contact networks as proxies for the 
underlying contact networks. For example, the fact that observed networks are 
likely to be much more dense than the underlying contact networks suggests 
that it would not be a good idea to use these observed networks naively in, e.g., 
simulations of virus transmission. However, our primary interest in this paper 
was not in the reconstruction of contact networks per se, but in the comparison of 
contact networks across experimental conditions. Standard experimental wisdom 
holds that errors in observations need not form an insurmountable problem as 
long as the structure and size of the induced distortions is comparable across 
experimental conditions; in this case, systematic errors will be invariant across 
conditions, and random errors will average out as the number of observations 


3 Here it is important to note that for COVID-19, vaccination does not preclude an 
individual from receiving or spreading the virus, although transmission rates among 


vaccinated individuals were reduced. (Eyre et al. 2022) 
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grows. This indeed is shown to be the case in our simulations. It should be 
noted that more advanced reconstruction methods may lead to biases if they 
are differentially effective across conditions, so a better reconstruction need not 
imply a better signal of intervention effectiveness. 


Our approach invites improvements at several points. First, we study a sim- 
ple measurement process. In the real world, it is likely that the observed network 
is not biased in a manner as we study here (i.e., by adding false positives and 
false negatives), but that more complicated biases occur as would, for example, 
arise when the structure of missing data depends on the network structure itself. 
Such biases can be addressed by adding an intermediary network reconstruction 
step. In particular, before computing the BECONS, one can use network recon- 
struction algorithms to first arrive at a better representation of the true network 


., |Clauset, Moore, & Newman| ; (Ghasemian, Hosseinmardi, Galstyan, 
Airoldi, & Clauset||2020 :|Guimera & Sales-Pardo 


Similarly, the use of additional sources of information deliverable through mobile 
phones (e.g., geographical location data, WIFI data, ultra wide-band technology) 
could enhance the precision of the signal 
2016). In addition, if the amount of bias in the measurement function is not 
equal across experimental conditions, the presented methodology would likely 
lead to incorrect conclusions. Because of the sparsity of the contact networks, 
even differences in random noise could potentially lead to bias in the effect sizes 
under certain conditions. For example, if the experimental intervention increases 
the percentage of false negatives, it can seem effective while it is not. Statisti- 
cal corrections could be developed on the basis of latent variable models, which 
are able to accommodate violations of measurement invariance to some extent 


11993; |Van De Schoot et al. 2013]. 


Another open question is whether the validity of BECONs is symmetric; can 
we pick up interventions that make the network more densely connected as easily 
as interventions that prune it? This is an important question in the process of 
monitoring lifting regulations, which is expected to create increasingly connected 
contact networks. BECON methodology could be used to assess the relative risk 
of different lifting interventions experimentally, and as such may inform exit 
strategies. 


Given that the value of the BECON methodology especially lies in its ability 
to tap into actual (distancing) behavior, we hope the present paper contributes 
to experimental research into the effectiveness of behavioral interventions in this 
domain. However, the interventions we have studied in this paper are very sim- 
ple, as they delete links at random. One may interpret such an intervention as 
reducing contacts at random. It would be interesting to investigate what hap- 
pens if an intervention does not randomly delete links, but affects the structure 
of the network in a different way (as for example in[Trajanovski et al.| 2013). For 
example, one could examine what happens if interventions selectively take out 
shortest paths, but keep clusters (e.g., groups of friends) intact. This would prob- 
ably change not only the density, but also the structure of the contact network 
after intervention; for example, selectively targeting nodes with high centrality 
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may lead the network to lose its small world character. These interventions would 
have considerable effect on epidemic spread. One such investigation was recently 
provided by [Block et al.] (2020). 'The authors study three types of interventions 
— limiting social interactions to a few individuals, seeking similarity across con- 
tacts, and strengthening communities — that change the contact network. They 
subsequently simulate virus spread and find that all three interventions sub- 
stantially flatten the infection curve compared to no intervention, as well as to 
an intervention that makes actors randomly reduce their contacts (i.e., removes 
links at random). As suggested above, this shows that more thoughtful inter- 
ventions can have a more drastic effect on virus transmission. Since different 
interventions change the contact network in different ways, it is important to 
choose the right BECON. In addition, such thought experiments suggest that 
the question under which conditions which BECONs can adequately track virus 
transmission is open for future research; one could imagine, for instance, imple- 
menting different interventions and running epidemiological virus spread models 
on the resulting networks to understand how different interventions change the 
epidemiological course of the virus. Finally, the current setup ignores dynami- 
cal information about interventions (e.g., the duration of effects), and extending 
measurements and models in this direction could augment the signal consider- 


ably (Dekker et al.||2021). Such information could then be used to assess these 


interventions in a more precise fashion. 


Not all interventions are amenable to the BECON approach. For instance, 
certain interventions may be highly effective in controlling the virus spread with- 
out implementing large changes in the network. A salient example arises when 
interventions are directed at interrupting processes that run on specific parts 
of the contact network, rather than at changing the network structure globally. 
Such interruptions are, for instance, the goal of contact tracing (Cencetti et al.| 
Kojaku, Hébert-Dufresne, Mones, Lehmann, & Ahn] 
.|[2020). Interventions based on contact tracing provide highly local and surgi- 
cal interventions on the network (namely, by isolating potentially infected cases, 
such procedures delete the corresponding links from the contact network). It is 
unlikely that global measures like BECONs would be able to pick up such subtle 
effects, especially if cases are rare so that relatively few links are deleted. Also, 
it would be important to study whether contact tracing interventions invariably 
lead to structures that diminish the potential for virus transmission; one can 
imagine situations where alarm systems based on contact tracing may instigate 
behavior that increases virus transmission. This would especially be relevant if 
alarm systems are unreliable or are activated too late, which underscores the 
importance of accurate prediction. 


In the future, network theory may assist behavioral scientists in developing 
novel interventions. For example, if advanced reconstruction of the contact net- 
work to a high level of precision becomes feasible, interventions could be shaped 
by the analysis of the baseline network itself. In such an approach, one could 
first analyze the baseline contact network, and then explicitly design interven- 
tions to target particular aspects of the contact network to induce maximal 
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change, analogous to the use of targeted vaccinations in epidemiology 
002). Similar 
interventions could be explicitly targeted to decrease the spectral radius of the 
contact network (Van Mieghem et al.||2011), because this controls the potential 
for outbreaks (Van Mieghem & Van de Bovenkamp) 2013] Van Mieghem & van de 
Bovenkamp| 2015). In accordance, targeted evaluations of changes in the contact 
network after intervention could be used to assess whether the intended changes 
have indeed been accomplished. This approach potentially defines an extensive 
research program, in which behavioral data scientists, epidemiologists, psychol- 
ogists, computer scientists, and statisticians could profitably work together to 
construct, implement, and monitor optimal interventions. 
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Appendix 


A Simulation results 


Additional tables and figures reporting the simulation results across all condi- 


tions. Complete results are available at https://gitlab.com/science-versus 


A.1 Small world graph 
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Figure 4. ASPL BECON indicator values as a function of the network size, the inter- 
vention effect sizes, the false positive rate, and the false negative rate. The true network 
structure is a small-world network. A fixed number of false positives and negatives were 
added in each run to create the observed network. 
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Figure 5. ASPL BECON indicator values as a function of the network size, the inter- 
vention effect sizes, the false positive rate, and the false negative rate. The true network 
structure is a small-world network. In the observed network, each absent edge in the 
true network had a probability, FP, of becoming a false positive of and each present 
edge a probability, FN, of becoming a false negative. 
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Figure 6. Density BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a small-world network. A fixed number of false positives and 
negatives were added in each run to create the observed network. 
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Figure 7. Density BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a small-world network. In the observed network, each absent edge 
in the true network had a probability, FP, of becoming a false positive of and each 


present edge a probability, FN, of becoming a false negative. 
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Figure8. Spectral BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a small-world network. A fixed number of false positives and 


negatives were added in each run to create the observed network. 
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Figure9. Spectral BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a small-world network. In the observed network, each absent edge 
in the true network had a probability, FP, of becoming a false positive of and each 
present edge a probability, FN, of becoming a false negative. 
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A.2 Scale-free graph 


Table 2. Simulation results across conditions for a scale free graph. The table reports 
correlations (means and sd) between BECONs and intervention effect sizes for deter- 
ministic and probabilistic interventions across simulation runs for multiple network 
sizes. T'he results are averaged over the false negative rates. 


n FP Deterministic Probabilistic 
Density Spectral ASPL Density Spectral ASPL 
rSD rSD rSD r SD r SD r SD 


100 0.110 10 10 0.97 0.02 0.98 0.02 0.97 0.02 
100 0.210 10 10 0.92 0.06 0.93 0.06 0.92 0.07 
100 0.310 10 10 0.86 0.1 0.87 0.1 0.85 0.1 
200 0.110 10 10 0.97 0.03 0.98 0.02 0.97 0.03 
200 0.210 10 10 0.92 0.06 0.93 0.06 0.91 0.06 
200 0.310 10 10 0.85 0.1 0.86 0.1 0.84 0.11 
500 0.110 10 10 0.97 0.03 0.97 0.02 0.97 0.03 
500 0.210 10 10 0.92 0.06 0.92 0.05 0.92 0.06 
500 0.310 10 10 0.86 0.1 0.86 0.09 0.86 0.1 
1000 0.110 10 10 0.97 0.03 0.97 0.02 0.97 0.03 
1000 0.210 10 10 0.92 0.06 0.92 0.05 0.92 0.06 


1000 0.310 10 10 0.86 0.1 0.86 0.1 0.86 0.1 
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Figure 10. ASPL BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a scale free network. A fixed number of false positives and negatives 
were added in each run to create the observed network 
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Figure11. ASPL BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a scale free network. In the observed network, each absent edge 
in the true network had a probability, FP, of becoming a false positive of and each 
present edge a probability, FN, of becoming a false negative. 
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Figure 12. Density BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a scale free network. A fixed number of false positives and negatives 
were added in each run to create the observed network. 
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Figure 13. Density BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a scale free network. In the observed network, each absent edge 
in the true network had a probability, FP, of becoming a false positive of and each 
present edge a probability, FN, of becoming a false negative. 
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Figure 14. Spectral BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a scale free network. A fixed number of false positives and negatives 
were added in each run to create the observed network. 
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Figure 15. Spectral BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is a scale free network. In the observed network, each absent edge 
in the true network had a probability, FP, of becoming a false positive of and each 
present edge a probability, FN, of becoming a false negative. 
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A.3  Erdós-Rényi random graph 


'Table 3. Simulation results across conditions for a Erdós-Rényi graph. The table 
reports correlations (means and sd) between BECONs and intervention effect sizes 
for deterministic and probabilistic interventions across simulation runs for multiple 
network sizes. The results are averaged over the false negative rates. 


n FP Deterministic Probabilistic 
Density Spectral ASPL Density Spectral ASPL 
rSD rSD rSD r SD r SD r SD 


100 0.110 10 10 0.97 0.02 0.97 0.02 0.97 0.02 
100 0.210 10 10 0.92 0.06 0.92 0.06 0.92 0.06 
100 0.310 10 10 0.85 0.11 0.85 0.11 0.84 0.11 
200 0.110 10 10 0.97 0.02 0.97 0.03 0.97 0.02 
200 0.210 10 10 0.92 0.05 0.92 0.05 0.92 0.06 
200 0.310 10 10 0.85 0.09 0.85 0.1 0.85 0.09 
500 0.110 10 10 0.97 0.03 0.97 0.03 0.97 0.03 
500 0.210 10 10 0.92 0.06 0.92 0.06 0.92 0.06 
500 0.310 10 10 0.85 0.1 0.85 0.1 0.85 0.1 
1000 0.110 10 10 0.97 0.03 0.97 0.03 0.96 0.03 
1000 0.210 10 10 0.91 0.06 0.91 0.06 0.91 0.06 


1000 0.310 10 10 0.85 0.11 0.85 0.11 0.85 0.11 
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Figure 16. ASPL BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is an Erdós-Rényi random network. A fixed number of false positives 
and negatives were added in each run to create the observed network 
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Figure 17. ASPL BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is an Erdós-Rényi random network. In the observed network, each 
absent edge in the true network had a probability, FP, of becoming a false positive of 
and each present edge a probability, FN, of becoming a false negative. 
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Figure 18. Density BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is an Erdós-Rényi random network. A fixed number of false positives 
and negatives were added in each run to create the observed network. 
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Figure 19. Density BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is an Erdós-Rényi random network. In the observed network, each 
absent edge in the true network had a probability, FP, of becoming a false positive of 
and each present edge a probability, FN, of becoming a false negative. 
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Figure 20. Spectral BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is an Erdós-Rényi random network. A fixed number of false positives 
and negatives were added in each run to create the observed network. 
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Figure 21. Spectral BECON indicator values as a function of the network size, the 
intervention effect sizes, the false positive rate, and the false negative rate. The true 
network structure is an Erdós-Rényi random network. In the observed network, each 
absent edge in the true network had a probability, FP, of becoming a false positive of 
and each present edge a probability, FN, of becoming a false negative. 
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Abstract. Bayesian approach is becoming increasingly important as 
it provides many advantages in dealing with complex data. However, 
there is no well-defined model selection criterion or index in a Bayesian 
context. To address the challenges, new indices are needed. The goal of 
this study is to propose new model selection indices and to investigate 
their performances in the framework of latent growth mixture models 
with missing data and outliers in a Bayesian context. We consider 
latent growth models because they are very flexible in modeling complex 
data and becoming increasingly popular in statistical, psychological, 
behavioral, and educational areas. Specifically, this study conducted five 
simulation studies to cover different cases, including latent growth curve 
models with missing data, latent growth curve models with missing data 
and outliers, growth mixture models with missing data and outliers, 
extended growth mixture models with missing data and outliers, and 
latent growth models with different classes. Simulation results show that 
almost all proposed indices can effectively identify the true model. This 
study also illustrated the application of these model selection indices in 
real data analysis. 


Keywords: Model selection criterion - Bayesian estimation - Latent 
growth models - Missing data - Robust method 


1 Introduction 


Bayesian approach is becoming increasingly important in estimating models as 


it provides many advantages in dealing with complex data (e.g., [Dunson] 2000). 


However, there is no well-defined model selection criterion or index in a Bayesian 


context (e.g.,|Celeux, Forbes, Robert, & Titterington||2006). It is due to at least 


three problems. First, in a Bayesian context there are two versions of deviance 
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because the Bayesian procedure generates Monte Carlo Markov chains for each 
parameter. One version is the posterior estimate, which can be estimated by 
a function of an estimate of a parameter. Another version is the Monte Carlo 
estimate of the expected deviance based on Bayesian iterations, which can be 
estimated as the posterior mean of a converged Markov chain. In short, the 
former is the deviance of the averaged estimates, and the latter is the average 
of all deviance iterations. The second problem is related to the complexity 
of the raw data. The data often come from heterogeneous populations which 
almost unavoidable contain outliers and missing values. The estimates from 
mis-specified models may result in severely misleading conclusions. The third 
problem relates to the likelihood function. When latent variables are considered 
in statistical models, the likelihood function can be an observed-data likelihood 
function, a complete-data likelihood function, or a conditional likelihood function 
(Celeux et al.|[2006). Furthermore, if data come from heterogeneous populations, 
the class membership indicator may have different versions, for example, a 
posterior mode or a posterior mean. Also, with missing data, the likelihood 
functions have different ways to construct. 


1.1 Model Selection Criteria/Indices 


Traditional model selection criteria or indices are available for researchers who 
try to select the best-fit model from a large set of candidate models. 
proposed the Akaike's information criterion (AIC), which offers a relative 
measure of the information lost. For Bayesian models the Bayes factor, which is 
the ratio of posterior odds to prior odds, can work for both hypothesis testing 
and model comparison. But the Bayes factor is often difficult or impossible 
to calculate, especially for models that involve random effects, large numbers 
of unknowns or improper priors. To approximate the Bayes factor, 
(1978) developed the Bayesian information criterion (BIC, sometimes called the 
Schwarz criterion). To obtain more precise indices, |Bozdogan| (1987) proposed 
the consistent Akaike information criterion (CAIC) [xum 7) proposed 
the sample-size adjusted ELS das criterion (ssBIC). Th dr RA 
information criterion (DIC; |S is a 
recently developed criterion designed for hierarchical models. It c 2 a on 
the posterior distribution of the log-likelihood and is useful in Bayesian model 
selection problems where the posterior distributions have been obtained by 
Markov chain Monte Carlo (MCMC) simulation. DIC is usually regarded as 
a generalization of AIC and BIC. It is defined analogously to AIC or BIC 
with a penalty term of the number equal to effective model parameters in 
Bayesian models. In practice, rough DIC (RDIC or DICV in some literature, 
e g., Oldmeadow & Keith} [2011) is an approximation of DIC. The mathematical 
forms of AIC, BIC, CAIC, ssBIC, and DIC are closely related to each other. 
They all try to find a balance between the accuracy and the complexity of the 
fitting model. For all indices above, the model with a smaller criterion/index 
value is better supported by data. 
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Lu, Zhang, and Cohen| (2013) proposed a series of Bayesian model selection 
indices based on the traditional ones. However, in (2013) the 


performances of these indices were investigated when data were non-mixture, 
normally distributed, and with simple non-ignorable missingness. And only 
latent growth models were used. 


1.2 Goals and Structure 


To address the challenges in model selection criterion/index in a Bayesian 
context, this paper proposes ten model selection indices. This paper also 
examines the performance of these indices under various conditions by 
conducting five simulation studies to cover different latent growth models, 
such as the robust growth models for non-normally distributed data, robust 
growth mixture models, and the extended robust growth mixture models with 
missing values. We consider latent growth models because they are very flexible 
in modeling complex data and becoming increasingly popular in statistical, 
psychological, behavioral, and educational areas. 

The rest of the article consists of five sections. Section 2 presents and 
formulates three types of models we used in this paper: latent growth models 
(including growth curve models, growth mixture models, and extended growth 
mixture models), robust growth models (including three types of robust 
models), and models that account for missingness (we focus on non-ignorable 
missingness). Section 3 proposes ten model selection indices in the framework of 
Bayesian growth models with missing data. Section 4 conducts five simulation 
studies to evaluate the performance of the Bayesian indices. Model selection 
results are analyzed, summarized, and compared. Section 5 illustrates the 
application of these model selection indices in real data analysis. Section 6 
discusses the implications and future directions of this study. 


2 Latent Growth Models, Robust Growth Models, and 
Missing Values 


Our investigation of the performance of the Bayesian selection indices involves 
fitting growth models to complex data. In this section, different types of growth 
models are briefly introduced. Given the fact that the data used in growth models 
are almost inevitably contain attrition (e.g.,|Little & Rubin| 
Lubke| and outliers (e.g.,|Maronna, Martin, & Yohai 
2006), different types of growth models are developed, which include traditional 
latent growth curve models with missing data 2013), robust growth 
curve models (Zhang, Lai, Lu, & Tong} |2013) with missing data 
[2021], growth mixture models (e.g., Bartholomew & Knott} with missing 


data (Lu & Zhang| |2014), extended growth mixture models (EGMMs, [Muthén] 
& Shedden||1999) with missing data (Lu & Zhang} 2014), and robust growth 
Lu & Zhang} 2014). 


mixture models with missing data ( 
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In the following, we discuss three types of models: latent growth models 
(including growth curve models, growth mixture models, and extended growth 
mixture models), robust growth models (including three types of robust 
models), and models that account for missingness (we focus on non-ignorable 
missingness). By combining different elements of these models, it becomes 
possible to consider a series of growth models with a variety of missing data 
mechanisms and contaminated data. 


2.1 Latent Growth Models 


The mathematical form of a latent growth curve model is 


yi = An, + ei 
TE ? Q) 


where y; is a T x 1 vector of outcomes for participant i(i = 1,...,N), n; is a 
q X 1 vector of latent effects, A is a T x q matrix of factor loadings for 7,;, e; is a 
T x 1 vector of residual or measurement errors, (9 is a qx 1 vector of fixed-effects, 
and £; captures the variation of 7;. We have to note that e; and £; are usually 
assumed normally distributed but not necessary. When data have outliers and 
are heavy-tailed, this assumption might cause estimate biases. To reduce the 
effects of outliers, we consider robust models in this study. 
A growth mixture model can be expressed as 


K 
flys) = 3 n fati), (2) 
k=1 


where 7; is the invariant class probability (or weight) for class k satisfying 
deis cpu MEDI ee iz ERE i 
1,..., K) is the density of a latent growth model for class k. 

or extended growth mixture models (EGMMs, [1999], 
Tp is not invariant across individuals. It is allowed to vary individually depending 
on covariates, so it is expressed as 7;x(x;). If a probit link function is used, then 


1 (xi) = 9(X; v) 
Tak (Xi) = D(X} Pr) ms D(X; x1); (k = 2,3,..,K — 1), (3) 
Tik(xi) = 1— P(X; ek. i) 


where &(-) is the cumulative distribution function (CDF) of the standard normal 
distribution, and X; = (1,x;)/ with an r x 1 vector of observed covariates x;. 
Note that P(X! pr) = 355, rij (xi) and P(X} y) = 1. 

A dummy variable z; = (21, 2i2,...,2iK)’ is used to indicate the class 
membership. If individual i comes from group k, zip = 1 and zi; = 0 (Vj z k). 


zi is multinomially distributed (McLachlan & Peel, 2000, p.7), that is, z; ~ 


MultiNomial (m1, Tiz, ..., Tix). 


Bayesian Model Selection Indices 39 


2.2 Robust Growth Models 


When data have outliers and are heavy-tailed, robust methods are used to 
reduce the effects of outliers. As t-distributions are more robust than normal 


distributions, the following are robust growth models (Lu & Zhang, 2021, 
2013). 


(1) t-Normal (TN) model in which the measurement errors are t-distributed 
and the latent random effects are normally distributed, 


ej ~ Mtr(0,0,v) (4) 
€; ~ MN, (0,¥) : 


where Mtr(0,0,v) is a T-dimensional multivariate t-distribution with a scale 
matrix © and degrees of freedom v, and M N,(0,V) is a q-dimensional 
multivariate Normal distribution with a covariance matrix V. 
(2) Normal-t (NT) model in which the measurement errors are normally 
distributed but the latent random effects are t-distributed, 
ej v M Nr(0, O) (5) 
E; ~ Mt, (0, P, u) ` 
(3) t-t (TT) model in which both the measurement errors and the latent 
random effects are t-distributed, 


e; ~ Mtr(0,0,v) 6 
£, ~ Mt,(0,W,u) ` (6) 


2.3 Missing Values 


We focus on the non-ignorable missingness in this paper. To build models with 
non-ignorable missingness, selection models 
are used. For individual i, let m; = (mi1, mij, ..., Mir)’ be a missing 
data indicator for y;, with m;, = 1 when yj; is missing and 0 when observed. Let 
Tit = p(mj, = 1) be the probability that y;; is missing. Then mj, ~ Bernoulli(7;), 
so its density function is f(m;) = TR#(1L — r4)0-":). The missingness 
probability Ti can have different forms. proposed the 
following non-ignorable missingness mechanisms for mixture models. 

(1) Latent-Class-Intercept-Dependent (LCID) missingness in which rj, is 
a function of latent class, covariates, and latent individual initial levels. For 
example, students are more likely to miss a test if their starting levels of that 
course are low. We model it as follows: 


Tit = O (zi, + Lyn + X; Yat) (7) 


where J; is the latent initial levels for individual i, yz; is the coefficient for 
li, Ya is the coefficient for class membership, and *,, are coefficients for 
covariates. For non-mixture homogenous growth models, LCID can be simplified 
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to Latent-Intercept-Dependent (LID) without the class membership indicator z; 
and expressed as 7j; = d (oo, + Livre + X; Yri), where Yor is the intercept. 

(2) Latent-Class-Slope-Dependent (LCSD) missingness in which T; is a 
function of latent class, covariates, and latent individual slopes of growth. For 
example, students are more likely to miss a test if they have slow growth of the 
course. In this case, 7;; can be modeled as 


Tit — D2 ot + Si^yst F Vat) (8) 


where S; is the latent slope for individual i, and ys; is the coefficient for S;. 
Similarly, for non-mixture homogeneous growth models, LCSD is simplified to 
Latent-Slope-Dependent (LSD) case as Tit = ®(Yor + Sivse + XV ut): 

(3) Latent-Class-Outcome-Dependent (LCOD) missingness in which 7;; is a 
function of latent class, covariates, and potential outcomes that may be missing. 
For example, a student who feels he is not doing well on the test may be more 
likely to give up taking the rest of the test. We express Tj, as 


Tit = (ZY + Yit Wut + Xia). (9) 
where yi; is the potential outcomes for individual 7 at time t, and ^y; is the 
coefficient for yi. And LCOD can be simplified to Latent-Outcome-Dependent 
(LOD) for non-mixture homogeneous growth models with a probability of 
missingness Ti, = P(Yor + Yit Yyt + Xii) 

In a more general framework, LCID and LCSD can be further encompassed 
into Latent-Class-Random Effect-Dependent missingness as intercept and 
slope are different random effects according to different situations under 
consideration. And for non-mixture structure, LID and LSD are encompassed 
into Latent-Random Effect-Dependent missingness. 


3 Bayesian Model Selection Indices 


In this section, we propose ten model selection criteria in the framework of 
Bayesian growth models with missing data. The definitions of the selection 
criteria are listed in Table The model selection criteria in the table are 
based on two versions of deviance in the Bayesian context, Epj,[D(0)| and 
D(Eg,[0]). As we have discussed in the introduction section, Egj,,[D] is the 
expected value of all the deviances, and D(Es,[0]) is the deviance score based on 
the expected parameters. For different models, the detailed mathematical forms 
of these two deviances are different. In this paper, we focus on both homogeneous 
and heterogeneous latent growth models with non-ignorable missing data. 

We first look at the homogeneous growth curve models with non-ignorable 
missing data. One version of deviance, Ep, |D(0)], is approximated by 


Epy[D())] ~ DO) =~ 53 YU lym) 
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Table 1. Model Selection Indices 


Index — Deviance 4- Penalty 
Dbar.AIC! D(6). 2p 

Dbar.BIC? D(0) log(N) p 
Dbar.CAIC D(0) (log(N)--1) p 
Dbar.ssBIC D(0) log((N+2) /24) p 
RDIC D(0) var(Dbar)/2 
Dhat.AIC D(6)° 2p 

Dhat.BIC | D(0) log(N) p 
Dhat.CAIC D(6) (log(N)+1) p 
Dhat.ssBIC D() log((N+2) /24) p 
DIC? D(6) 2pD 


Note. 
1. p is the number of parameters. 
2. N is the sample size. 
3. pD = D(0) — D(6). 
4. D(0) is shown as in eqn.(10) for growth curve models and as in eqn. (13) for 
growth mixture models. 


5. D(0) is shown as in eqn. (12) for growth curve models and as in eqn. (14) for 
growth mixture models. 


where S is the number of iterations for converged Markov chains, 1 (0 y, m) = 
log( L (Oly, m)) is a conditional joint loglikelihood function (see, Celeux et al.| 
of y and m, mj, is the missing data indicator for individual ? at time t 
with a likelihood function l;j;(m) = Milog(Tit) + (1 — mj )log(1 — Tit), where 
Ti; is the missing data rate for individual i at time t and is defined differently 
for different missingness models as in the previous section. When yj; is missing, 
the corresponding likelihood is excluded. So combining y and m, the conditional 
likelihood function of a selection model with non-ignorable missing data can be 
expressed as 


La(0|y, m) = [f(yln) n 9 zm, (11) 


And the other version of deviance, D(Esj,[0]), is approximated by 


N T 
D(Eoy[0]) = DÔ) = -25 Y |a = mijtatul) + að), (2) 


i=l t—1 


where Ó is the posterior mean of parameter estimates across S iterations. 
For growth mixture models with missing data, Eg, [D] is expressed as 


EpylD(9)) DO = 5379 Ae D [0 moli) + Hs (m)] (13) 


wn 
Il 
un 
2 
ll 
un 
> 
ll 
un 
+ 
ll 
m 
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where z; = (2i1,Zi2,..., Zik) is the class membership indicator which follows a 
multinomial distribution, z; ~ MultiNomial(n;i,72,...,T;k), and 2D is the 
class membership estimated at iteration s. And 


T 
D(Eoy[8]) = D(B) = -29 2 2a Y; [0 — midtas(vlÓ) + linen] . (14) 


where 2j; is the posterior mode of class membership, Ê is the posterior mean of 
parameter estimates across all S iterations. In both D(0) and D(@) definitions 
of deviance, lixt(y) and lixe(m) are the conditional loglikelihood functions for yit 
and mij, respectively, for individual 7 in class k at time t. 


The difference between D(0) and D(0) can be quantified by a statistic called 


pD (Spiegelhalter et al.{ |2002), 


pD =D) - DÔ). (15) 


Based on the Jensen’s inequality (Casella & George||1992), when D(0) is convex, 


then D(0) > D(0) and as a result pD is positive. When D(@) is concave, then 


D(0) € D(0) and pD is negative. 


4 Simulation Studies 


In this section, five simulation studies are conducted to evaluate the performance 
of the Bayesian indices. For each study, four waves of complete data are 
generated first and then missing data are created on each occasion according 
to pre-designed missing data rates. After data are generated, full Bayesian 
methods are used by adopting uninformative priors, obtaining conditional 
posterior distributions through application of a data augmentation algorithm, 
generating Markov chains through a Gibbs sampling procedure, conducting 
convergence testing, and making statistical inference for model parameters. For 
all simulations, the software OpenBUGS is used for the implementation of Gibbs 
sampling, and R is used for data-generation, convergence testing, and parameter 
estimation. 

'The five studies are designed such that the data complexity increases from 
study 1 to study 5. Studies 1-2 focus on non-mixture growth data and thus, 
latent growth curve models with missing data are used. Studies 3-5 focus on 
mixture growth data and thus, growth mixture models with missing data are 
used. Simulation factors include measurement error distributions, random-effects 
distributions, missingness patterns, sample size, and class separation 


& Bahadur||1962). Under each condition, 100 converged replications are used to 


calculate the model selection proportion. Table [2] lists the design details. 
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Table 2: Simulation Study Design 


Distribution Missingness Sample Size Class 
Study Model ei? m? Depends on Separation!! 


NPN t COX ES? Y™ Different M S 


Study 1 Normal LGCMs: use relative small sample sizes due to single-class 
data 


NN-ignorable v "4 
NN-XI v vV 
NN-XS! Vv Vv 
NN-XY Vv Vv 


v 


Study 2 Robust LGCMs: use relative small sample sizes due to single-class 
data 


TN-ignorable 
TN-XI 
TN-XS 
TN-XY 


KSSS 


TT-ignorable 
TT-XI 
TT-XS 
TT-XY 


SSS NSSS 


NT-ignorable 
NT-XI 
NT-XS 
NT-XY 


SRR A RC 


NN-ignorable 


"4 
NN-XI V 
Vv 


BSR RR RAGS UR EO o ce ode 


v 
v 
v 
v 
v 
v 
v 
v 


v 


Study 3 Robust GMMs (RGMMs): use relative large sample sizes due to 
multiple classes data, and use small class separation 
due to fixed class probabilities 


'TN-ignorable Vv V V 
TN-XI Vv Vv Vv 
TN-XS Vv V vy V 
TN-XY Vv Vv Vv Vv 
''I-ignorable "4 v "4 "4 
TT-XI Vv Vv Vv Vv 
TT-XS Vv Vv vv Vv 
TT-XY Vv v v v V 
NT-ignorable v v v V 
NT-XI Vv Vv Vv Vv 
NT-XS v v V vy V 
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NT-XY Vv Vv v v " 
NN-ignorable v v "4 "4 
NN-XI Vv Vv Vv Vv 
NN-XS Vv Vv Mov v 
NN-XY V v v v v 


Study 4 Robust Extended GMMs (REGMMs): select 5 competing models 
based on the performance in Study 3 use relative large sample sizes 
due to multiple-class data and varied class probabilities 


TN-CXS Vv Vv Y v v 
TN-CX Vv Vv v v 
TT-CXS Vv VV Vv v v 
NN-CXS Vv Vv Vv Vv v v 
NN-CX Vv Vv Vv v v 


Study 5 Single-Class LGCMs vs. Multiple-Class RGMMs 
1 Class LGCMs 


TN-XS Vv "4 v 

TT-XS "4 "4 "4 "4 

NN-XS "4 "4 "4 "4 

2 Classes RGMMs 

TN-XS Vv "4 "4 v 
TT-XS "4 "4 "4 "4 v 
NN-XS "4 "4 "4 "4 v 
3 Classes RGMMs 

TN-XS Vv v v 

TT-XS "4 "4 "4 v 

NN-XS "4 "4 v v 

4 Classes RGMMs 

TN-XS Vv "4 v 

TT-XS "4 "4 "4 v 

NN-XS "4 "4 v "4 


Note. 1 The shaded model is the true model. 2 Measurement errors. 
3 Random effects. 4 Normal distribution. 5 t distribution. 
6 Latent class dependent (Non-ignorable). 7 Observed Covariates. 
8 Latent intercept dependent (Non-ignorable). 9 Latent slope 
dependent (Non-ignorable). 10 Potential outcome y dependent 
(Non-ignorable). 11 Class Separation 
: medium=2.7). 


when generating data (S: small=1.7, 


Study 1 investigated the performance of the Bayesian indices when data 
were non-mixture, homogeneous, normally distributed with non-ignorable 
missingness. The true model was NN-XS, which was the model with normally 
distributed measurement errors (e;) at level 1 and random effects (£;) at level 
2, with missingness depending on covariate x and latent slope S. Specifically, 
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e; ~ MN(0,L), n; ~ MN4(8,V) where 8 = (Intercept, Slope) = (1,3) and ¥ 
was a 2 by 2 symmetric matrix with Var(I) = 1, Cov(I, $) = 0, and Var(S) = 4. 
For missingness, the bigger the latent slope was, the higher the missing data rate 
would be. The missingness probit coefficients were set as yọ = (—1, —1, —1, — 1), 
Ye = (—1.5,—1.5, —1.5, —1.5), and yg = (0.5,0.5,0.5,0.5). For example, if a 
participant had a latent growth slope 3, with a covariate value 1, then his or 
her missing probability at each wave was 7 ~ 16%; if the slope was 5, with 
the same covariate value, the missing probability increased to r = 50%; but 
if the slope was 1, then the missing probability decreased to r = 2.3%. The 
covariate x was also generated from a normal distribution, z ~ N(1, sd = 0.2). 
In study 1, in total there were 16 conditions with 4 missingness mechanisms (XS 
non-ignorable, XY non-ignorable, XI non-ignorable, and ignorable) combined 
with 4 levels of sample size (1000, 500, 300, and 200). Table [B] lists the model 
selection proportions across 100 replications for each of these indices across all 
conditions in study 1. The largest proportion across 4 missingness models is 
indicated in the shaded cell for each index. When sample size is relatively large, 
1000 or 500, all of the model selection indices, except for the rough DIC (RDIC), 
correctly identify the true model with 100%. When sample size becomes smaller, 
300 or 200, except for the RDIC, all of the model selection indices choose the 
true model with certainty above 93%. Comparing the indices defined based on 
Dbar with those defined based on Dhat, one can see that the former performs a 
little bit better. 


Study 2 investigated the performance of these indices when data were 
non-mixture homogeneous with outliers and non-ignorable missingness. The 
main difference between study 2 and 1 was that the data for study 2 contain 
outliers such that they are not normally distributed. So robust growth curve 
models were used. The true model was TN-XS, which means measurement errors 
(e;) at level 1 followed a t-distribution. Specifically, e; were generated from a t 
distribution with 5 degrees of freedom and a scale matrix I, i.e., e; ~ Mt(0,I, 5). 
Other settings were kept the same as those in study 1. In this study, totally 32 
conditions were considered with 4 data distributions (NN, TN, NT, and TT), 
4 missingness patterns (XS non-ignorable, XY non-ignorable, XI non-ignorable, 
and ignorable), and 2 levels of sample size (1000 and 500). Tablel[4]lists the model 
selection proportions. The largest proportion across 16 missingness models is 
indicated in the shaded cells for each index. Except for the RDIC, all of the 
model selection indices correctly identify the true model. TT-XS is a competing 
model, which also gains high selection probabilities. This is because the normal 
distribution is almost identical to a t-distribution with large degrees of freedom. 
'The degrees of freedom of t is also estimated by the model. Also, the Dbar-based 
indices performs a little bit better than the Dhat-based indices. Among them, 
Dbar-based BIC and CAIC perform best. 


Study 3 was designed for mixture data with outliers and non-ignorable 
missing data. As data were mixture, growth mixture models were used. In 
this study, the true model was 2-class mixture TN-XS RGMM. Only intercepts 
of these 2 classes were different, with 5 for class 1 and 1 for class 2. Other 
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'Table 3. Model Selection Proportion in Study 1 


N=1000 N=500 

Non-ignorable Ignorable Non-ignorable Ignorable 
Criteron! |NN-XS? NN-XY? NN-XI NN? |NN-XS NN-XY NN-XI NN 
Dbar.AIC i” 0.000 0.000 0.000 1 0.000 0.000 0.000 
Dbar.BIC 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
Dbar.CAIC| 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
Dbar.ssBIC| 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
RDIC 0.013 0.000 0.987 0.000 0.038 0.000 0.962 0.000 
Dhat.AIC 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
Dhat.BIC 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
Dhat.CAIC| 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
Dhat.ssBIC| 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
DIC 1 0.000 0.000 0.000 1 0.000 0.000 0.000 
N=300 N=200 
Dbar.AIC | 0.98125 0.01875 0.000 0.000 0.975 0.025 0.000 0.000 
Dbar.BIC | 0.98125 0.01875 0.000 0.000 0.975 0.025 0.000 0.000 
Dbar.CAIC| 0.98125 0.01875 0.000 0.000 0.975 0.025 0.000 0.000 
Dbar.ssBIC| 0.98125 0.01875 0.000 0.000 0.975 0.025 0.000 0.000 
Rough DIC| 0.1125 — 0.000 0.8875 0.000 0.2 0.03125 0.76875 0.000 
Dhat.AIC | 0.95 0.05 0.000 0.000 | 0.9375 0.06875 0.000 0.000 
Dhat.BIC 0.95 0.05 0.000 0.000 | 0.9375 0.06875 0.000 0.000 
Dhat.CAIC| 0.95 0.05 0.000 0.000 | 0.9375 0.06875 0.000 0.000 
Dhat.ssBIC| 0.95 0.05 0.000 0.000 | 0.9375 0.06875 0.000 0.000 
DIC 1 0.000 0.000 0.000 0.98125 0.0125 0.00625 0.000 
Note. 


1. The definition of each index is given in Tab 


«n 


2. The shaded model is the true model. The model is normal-distribution-based 


with 


latent-slope-dependent missingness. 


3. The model is normal-distribution-based with potential-outcome-dependent 
missingness. 


4. 'The 


model 


missingness. 
5. The model is normal-distribution-based with ignorable missingness. 


6. 'The shaded cell has the largest proportion. 


is normal-distribution-based with 


latent-intercept-dependent 
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'Table 4. Model Selection Proportion in Study 2 


4T 


N=1000 N=500 
Non-ignorable Ignorable | Non-ignorable Ignorable 
Index XS? XY XI XS XY XI 
Dbar.AIC 'TN*|0.519 0.000 0.000 0.000 [0.597 0.013 0.000 0.000 
TT? |0.469 0.000 0.000 0.012 [0.377 0.000 0.000 0.000 
NT? [0.000 0.000 0.000 0.000 — 10.006 0.000 0.000 0.000 
NN*|0.000 0.000 0.000 0.000 — 0.006 0.000 0.000 0.000 
Dbar.BIC ‘TN /0.781 0.000 0.000 0.000 [0.855 0.013 0.000 0.000 
TT |0.200 0.000 0.000 0.019  /0.113 0.000 0.000 0.000 
NT /0.000 0.000 0.000 0.000 — 10.006 0.000 0.000 0.000 
NN |0.000 0.000 0.000 | 0.000 — 10.013 0.000 0.000 0.000 
Dbar.CAIC TN /|0.819 0.000 0.000 0.000 — |0.888 0.012 0.000 0.000 
TT |0.162 0.000 0.000 0.019 0.075 0.000 0.000 0.000 
NT /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
NN /0.000 0.000 0.000 0.000 — 10.019 0.000 0.000 0.000 
Dbar.ssBIC TN |0.625 0.000 0.000 0.000 [0.631 0.012 0.000 0.000 
TT |0.362 0.000 0.000 0.012 |0.338 0.000 0.000 0.000 
NT [0.000 0.000 0.000 0.000 — 10.006 0.000 0.000 0.000 
NN /0.000 0.000 0.000 0.000 — 10.006 0.000 0.000 0.000 
RDIC TN |0.000 0.000 0.106 0.000 {0.000 0.000 0.094 0.000 
TT |0.000 0.000 0.100 0.000  /0.000 0.000 0.113 0.000 
NT /0.000 0.000 0.394 0.000 — 10.000 0.000 0.390 0.000 
NN |0.000 0.000 0.400 0.000 [0.000 0.000 0.403 0.000 
Dhat.AIC TN /|0.544 0.000 0.000 0.000 {0.547 0.025 0.000 0.000 
TT |0.506 0.006 0.000 0.000 |0.447 0.019 0.000 0.000 
NT /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
NN /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
Dhat.BIC TN /0.675 0.006 0.000 0.000  |0.717 0.025 0.000 0.000 
TT |0.319 0.000 0.000 0.000 0.245 0.013 0.000 0.000 
NT /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
NN /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
Dhat.CAIC TN |0.700 0.006 0.000 0.000 {0.788 0.025 0.000 0.000 
TT |0.294 0.006 0.000 0.000 0.169 0.012 0.000 0.000 
NT /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
NN /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
Dhat.ssBIC TN |0.575 0.006 0.000 0.000 |0.588 0.025 0.000 0.000 
TT |0.419 0.006 0.000 0.000 |0.369 0.012 0.000 0.000 
NT [0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
NN /0.000 0.000 0.000 0.000 — 10.000 0.000 0.000 0.000 
DIC TN |0.325 0.000 0.000 0.000 [0.415 0.006 0.000 0.000 
TT |0.462 0.000 0.000 0.194 |0.409 0.000 0.000 0.000 
NT [0.012 0.000 0.000 0.000 — 10.088 0.000 0.000 0.000 
NN /0.006 0.000 0.000 0.000 — 10.082 0.000 0.000 0.000 
Note. 1^ T-Normal, T-T, Normal-T, and Normal-Normal measurement errors and 


^g 


random effects. >Other abbreviations are as given in Table [3] 
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settings for each class were the same as in study 2. Both classes have ts 
distributed measurement errors. Based on (1962), the 
class separation is around 2.7. In this study, we assumed they are traditional 
mixture models, i.e., class probabilities were fixed at (50%, 50%) in this study. 
The same as in study 2, there were 32 conditions considered with 4 data 
distributions (NN, TN, NT, and TT), 4 missingness patterns (XS non-ignorable, 
XY non-ignorable, XI non-ignorable, and ignorable), and 2 levels of sample size 
(1000 and 1500). As mixture data require more data to obtain estimates, we 
increased the sample size. Table |5| shows the results for study 3. The shaded 
cell indicates the largest proportion across 16 missingness models for each index. 
Again, almost all of the model selection indices correctly identify the true model. 
And the Dbar-based indices perform a little bit better than the Dhat-based 
indices. Specifically, Dbar-based BIC and CAIC perform best among these 
indices, and then Dbar-based ssBIC also perform well. 

Study 4 extended study 3 such that the class probabilities were not 
fixed. Instead, they depended on values of covariates. Also, the non-ignorable 
missingness in this study was allowed to depend on the corresponding 
observations’ latent class membership. The true model in this study was 2-class 
mixture TN-CXS robust extended growth mixture models (REGMM). The 
differences between this study and study 3 were (1) the class proportions in this 
study were predicted by the value of covariate x; (2) the missing data rates were 
predicted by the latent class membership; and (3) both medium size, 2.7, and 
small size, 1.7, class separations were used. Specifically, for small class separation, 
the intercept for class 1 was 3.5 and the intercept for class 2 was 1. To simplify 
the simulation, based on the findings in study 3, 5 competing mixture models 
(TN-CXS, TT-CXS, TN-CX, NN-CXS, and NN-CX) were chosen to fit the data. 
Totally, we considered 20 conditions with 5 mixture models, 2 levels of sample 
size (1500 and 1000), and 2 levels of class separation (2.7 and 1.7). Table [6] 
shows the model selection proportions in study 4. Again, almost all of the model 
selection indices correctly identify the true model. Specifically, Dbar-based BIC 
and CAIC perform best among these indices. 

Study 5 focused on the number of classes. In this study, different growth curve 
models with different numbers of classes were fitted and compared. In total, 9 
conditions were considered, including 3 models (TN-XS, TT-XS, NN-XS) and 
3 numbers of classes (1, 2, and 3). The true model was the 2-class mixture 
TN-XS model. The simulation results for study 5 were presented in Table 
Among these indices, Dhat-based indices perform better than Dhbar-based 
indices. Specifically, Dhat-based BIC and CAIC perform best, and ssBIC and 
AIC also provide high certainty. 
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'Table 5. Model Selection Proportion in Study 3 


N=1500 


N=1000 


Non-ignorable 


XS XY XI 


Ignorable 


Non-ignorable 


XS XY XI 


Ignorable 


Index 
Dbar.AIC  TN|0.621 0.000 0.000 0.000 0.593 0.000 0.000 0.000 
''T' 0.357 0.000 0.000 0.000 |0.314 0.000 0.000 0.000 
N'T/0.000 0.000 0.000 0.000 |0.021 0.000 0.000 0.000 
NN/0.021 0.000 0.000 0.000 |0.071 0.000 0.000 0.000 
Dbar.BIC  TN|0.864 0.000 0.000 0.000 /|0.843 0.000 0.000 0.000 
'T'T'/0.114 0.000 0.000 0.000 {0.064 0.000 0.000 0.000 
N'T/0.000 0.000 0.000 0.000 |0.014 0.000 0.000 0.000 
NN/0.021 0.007 0.000 0.000 {0.079 0.000 0.000 0.000 
Dbar.CAIC TN|0.893 0.000 0.000 0.000 (0.857 0.000 0.000 0.000 
''T 0.079 0.000 0.000 0.000 — 0.043 0.000 0.000 0.000 
NT/0.000 0.000 0.000 0.000 10.007 0.007 0.000 0.000 
NN/0.021 0.007 0.000 0.000 10.086 0.000 0.000 0.000 
Dbar.ssBIC TN|0.729 0.000 0.000 0.000 |0.750 0.000 0.000 0.000 
''T' 0.250 0.000 0.000 0.000 {0.157 0.000 0.000 0.000 
NT/0.000 0.000 0.000 0.000 A 0.014 0.000 0.000 0.000 
NN/0.021 0.007 0.000 0.000 10.079 0.000 0.000 0.000 
RDIC TN}0.071 0.000 0.000 0.000 — 0.143 0.000 0.000 0.000 
''T 0.086 0.000 0.000 0.000 — 0.071 0.000 0.000 0.000 
NT/0.450 0.000 0.000 0.000 10.393 0.007 0.000 0.000 
NN/0.393 0.000 0.000 0.000 (0.379 0.007 0.000 0.000 
Dhat.AIC  TN|0.586 0.000 0.000 0.000 0.621 0.000 0.000 0.000 
''T' 0.379 0.000 0.000 0.000 0.329 0.000 0.000 0.000 
NT/0.014 0.000 0.000 0.000 /0.014 0.007 0.000 0.000 
NN/0.014 0.007 0.000 0.000 10.057 0.000 0.000 0.000 
Dhat.BIC  TN|0.757 0.000 0.000 0.000 [0.793 0.000 0.000 0.000 
''T' 0.207 0.000 0.000 0.000 {0.121 0.000 0.000 0.000 
NT/0.007 0.000 0.000 0.000 0.007 0.007 0.000 0.000 
NN/0.021 0.007 0.000 0.000  |0.071 0.000 0.000 0.000 
Dhat.CAIC TN|0.757 0.000 0.000 0.000 /0.814 0.000 0.000 0.000 
TT 0.207 0.000 0.000 0.000 — |0.100 0.000 0.000 0.000 
NT/0.007 0.000 0.000 0.000 0.007 0.007 0.000 0.000 
NN/0.021 0.007 0.000 0.000 (0.071 0.000 0.000 0.000 
Dhat.ssBIC 'TN|0.586 0.000 0.000 0.000 |0.664 0.000 0.000 0.000 
TT}0.379 0.000 0.000 0.000 |0.250 0.000 0.000 0.000 
NT/0.014 0.000 0.000 0.000 0.014 0.007 0.000 0.000 
NN/0.014 0.007 0.000 0.000  |0.064 0.000 0.000 0.000 
DIC TN}0.507 0.000 0.000 0.000 (0.364 0.007 0.000 0.000 
'T'T' 0.371 0.000 0.000 0.000 |0.286 0.000 0.000 0.000 
NT/0.043 0.036 0.000 0.000 /0.129 0.029 0.007 0.000 
NN/0.043 0.000 0.000 0.000 10.150 0.029 0.000 0.000 


Note. Same as Table BI 
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'Table 6. Model Selection Proportion in Study 4 


Index TN-CXS TT-CXS NN-CXS TN-CX NN-CX 


Class Separation=2.7, N=1500 

Dbar. AIC 0.567 0.425 0.000 0.008 0.000 
Dbar.BIC 0.808 0.158 0.000 0.033 0.000 
Dbar.CAIC 0.850 0.108 0.000 0.0042 0.000 
Dbar.ssBIC 0.667 0.300 0.000 0.033 0.000 
RDIC 0.042 0.042 0.908 0.000 0.008 
[lpt] Dhat.AIC 0.475 0.392 0.000 0.133 0.000 
Dhat.BIC 0.550 0.233 0.000 0.217 0.000 
Dhat.CAIC 0.525 0.233 0.000 0.242 0.000 
Dhat.ssBIC 0.467 0.367 0.000 0.167 0.000 
DIC 0.467 0.500 0.033 0.000 0.000 
Class Separation=2.7, N=1000 

Dbar. AIC 0.558 0.375 0.000 0.067 0.000 
Dbar.BIC 0.750 0.125 0.000 0.125 0.000 
Dbar.CAIC 0.767 0.100 0.008 0.125 0.000 
Dbar.ssBIC 0.633 0.292 0.000 0.075 0.000 
RDIC 0.092 0.075 0.808 0.000 0.025 
[lpt] Dhat.AIC — 0.350 0.358 0.000 0.292 0.000 
Dhat.BIC 0.450 0.175 0.000 0.375 0.000 
Dhat.CAIC 0.442 0.150 0.000 0.4 0.008 
Dhat.ssBIC 0.392 0.300 0.000 0.308 0.000 
DIC 0.417 0.450 0.108 0.008 0.017 
Class Separation=1.7, N=1500 

Dbar. AIC 0.512 0.444 0.044 0.000 0.00 
Dbar.BIC 0.744 0.212 0.044 0.000 0.00 
Dbar.CAIC 0.781 0.175 0.044 0.000 0.00 
Dbar.ssBIC 0.612 0.344 0.044 0.000 0.00 
RDIC 0.306 0.238 0.350 0.006 0.10 
[lpt] Dhat.AIC 0.475 0.475 0.031 0.019 0.00 
Dhat.BIC 0.712 0.238 0.031 0.019 0.00 
Dhat.CAIC 0.712 0.238 0.031 0.019 0.00 
Dhat.ssBIC 0.475 0.475 0.031 0.019 0.00 
DIC 0.381 0.450 0.169 0.000 0.00 
Class Separation=1.7, N=1000 

Dbar.AIC 0.550 0.400 0.050 0.000 0.000 
Dbar.BIC 0.719 0.194 0.081 0.006 0.000 
Dbar.CAIC 0.750 0.162 0.081 0.006 0.000 
Dbar.ssBIC 0.638 0.300 0.062 0.000 0.000 
RDIC 0.244 0.256 0.362 0.000 0.138 
[lpt] Dhat.AIC 0.694 0.231 0.012 0.062 0.000 
Dhat.BIC 0.644 0.294 0.012 0.050 0.000 
Dhat.CAIC 0.694 0.231 0.012 0.062 0.000 
Dhat.ssBIC 0.575 0.388 0.012 0.025 0.000 
DIC 0.344 0.331 0.319 0.000 0.006 


Note. Same as Table [3] 
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Table 7. Model Selection Proportion in Study 5 


2 CLASSES 1 CLASS 3 CLASSES 

Index TN-XS TT-XS NN-XS|TN-XS TT-XS NN-XS TN-XS TT-XS NN-XS 
Dbar.AIC 0.000 0.000 0.057} 0.393 0.129 0.000} 0.021 0.007 0.393 
Dbar.BIC 0.000 0.000 0.036} 0.821 0.064 0.000} 0.000 0.000 0.079 
Dbar.CAIC,| 0.000 0.000 0.036} 0.864 0.043 0.000} 0.000 0.000 0.057 
Dbar.ssBIC| 0.000 0.000 0.057} 0.593 0.100 0.000) 0.000 0.000 0.25 
RDIC 0.036 0.014 0.2, 0.014 0.014 0.679) 0.014 0.014 0.014 
Dhat.AIC 0.621 0.343 0.064) 0.000 0.000 0.000} 0.000 0.000 0.000 
Dhat.BIC 0.793 0.136 0.071} 0.000 0.000 0.000} 0.000 0.000 0.000 
Dhat.CAIC| 0.814 0.114 0.071} 0.000 0.000 0.000) 0.000 0.000 0.000 
Dhat.ssBIC| 0.664 0.264 0.071) 0.000 0.000 0.000) 0.000 0.000 0.000 


DIC 0.000 0.000 0.000} 0.164 0.193 0.121) 0.000 0.000 0.521 
Note. Same as Table 


5 Application 


In this section, a real data set on mathematical growth is analyzed to 
demonstrate the application of the indices. The same sample that has been 


analyzed in (2011) is used here. It is a mathematical ability 
growth sample from the NLSY97 survey (Bureau of Labor Statistics, U.S. 


1997), which were collected from N = 1510 adolescents 
yearly from 1997 to 2001 when each adolescent was administered the Peabody 
Individual Achievement Test (PIAT) Mathematics Assessment to measure their 
mathematical ability. There are some outliers at all five grades. 
conducted a power transformation to normalize the sample and assumed the 
data are normally distributed without outliers. In this study, however, we use 
the original non-transformed data with outliers, but robust methods are used. 
Also, different non-ignorable missingness mechanisms are considered. Overall, 
the means of mathematical ability increased over time with a roughly linear 
trend. The missing data rates range from 4.57% to 9.47%, and the raw data 
show the missing pattern is intermittent. About half of the sample is female. 
The analysis is conducted following the steps in Table|8} In step 1, a tentative 
model (the TT-ignorable model) is fitted to the data. Gender is a covariate. 
The estimates of degrees of freedom of t for both classes are 2.342 and 3.263 
for measurement errors and 75.65 and 50.96 for random effects, which indicates 
that measurement errors can be better fitted using a t distribution while random 
effects are approximately normally distributed (i.e., a TN model). And then 
in step 2, to compare models with different non-ignorable missingness and 
numbers of classes, 10 models are fitted to the data. During estimation we 
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'Table 8. Steps and Fitting Models in Real Data Analysis 


Step 1: Fit a tentative 2 classes model, and check 
the estimated df of t 


ei N; missingness 
Model NT NTCXIS Y 


TT-ignorable v "4 


Step 2: Try models with different missingness and 
number of classes 


2 Classes RGMMs 


TN-X Vv "4 

'TN-XI Vv Vv 
TN-XS Vv vv 
TN-XY Vv "4 "4 
2 Classes REGMMs 

TN-CX Vv Vv 

TN-CXI Vv VV Vv 
TN-CXS Vv Vv Vv 
TN-CXY vv Vv "4 
3 Classes GMMs 

NN-X "4 "4 "4 

4 Classes GMMs 

NN-X "4 "4 "4 


Step 3: Compare selection indices 


Step 4: Interpret results obtained from the selected 
model 


Note. Abbreviations are as given in Table DI 
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Table 10. Estimates of TN-CXY REGMM in Real Data Analysis 


Parameter Mean S.D. MC.e./S.D.! Lower? Upper? Geweke z* 
Intercept 8.647 0.037 0.026 8.572 8.717 0.007 
Slope 0.229 0.009 0.023 0.211 0.247 0.014 
E P Var(I) 0.234 0.028 0.024 0.183 0.293 -0.009 
© E Var(S) 0.014 0.002 0.018 0.011 0.017 0.004 
8 O Cov(I1, S) -0.036 0.006 0.022 -0.049 -0.026 -0.005 
d Var(e) 0.044 0.004 0.031 0.037 0.053 0.024 
9 dfy” 2.386 0.205 0.043 2.118 2.900 0.050 
5 Intercept 6.196 0.047 0.020 6.103 6.287 0.054 
O Slope 0.315 0.011 0.022 0.295 0.336 0.036 
S j^ Var(I) 1.326 0.084 0.017 1.167 1.497 0.020 
3 É: Var(S) 0.034 0.004 0.022 0.027 0.042 0.010 
O OOCov(I,S) 0.010 0.014 0.021 -0.018 0.037 -0.023 
Var(e) 0.372 0.020 0.033 0.336 0.412 -0.061 
dfy 3.200 0.195 0.040 2.850 3.600 -0.042 
A 10? -0.214 0.119 0.051 -0.438 0.018 -0.039 
O vu -0.223 0.077 0.051 -0.372 -0.076 0.026 
o yo -0.711 0.532 0.066 -1.843 0.204 -0.255 
€ vie -0.132 0.216 0.058 -0.527 0.310 0.231 
E dai? -0.154 0.108 0.046 -0.368 0.058 0.008 
yy, -0.087 0.059 0.065 -0.190 0.038 0.251 
oo V02 -1.157 0.446 0.064 -2.097 -0.447 -0.373 
p d» 0.046 0.217 0.055 -0.345 0.489 0.347 
E E 0.113 0.114 0.046 -0.109 0.334 0.032 
E v2 -0.108 0.045 0.062 -0.188 -0.021 0.330 
E o 03 -0.613 0.454 0.065 -1.519 0.163 -0.462 
a Sis -0.057 0.181 0.056 -0.403 0.292 0.381 
z & ^/z3 -0.147 0.094 0.046 -0.332 0.038 0.045 
£ z Vv3 -0.074 0.045 0.064 -0.155 0.022 0.459 
S 04 -0.032 0.512 0.066 -0.861 0.985 -0.426 
2 "ria -0.324 0.204 0.059 -0.732 0.029 0.362 
8 Yona 0.059 0.101 0.047 -0.142 0.251 0.128 
0 YA -0.166 0.050 0.065 -0.266 -0.084 0.378 
J 0s -1.298 0.421 0.065 -2.130 -0.442 -0.192 
2 yis 0.341 0.176 0.055 0.015 0.708 0.159 
© Yas -0.087 0.091 0.045 -0.263 0.083 0.001 
v5 -0.019 0.040 0.064 -0.092 0.062 0.189 

1 Ratio of MC error to standard deviation. A value around or less than 0.05 


indicates that the corresponding estimate is accurate (Spiegelhalter, Thomas, 
Best, & Lunn, 2003). 


2-3 The lower 2.5 percentile and upper 97.5 percentile. 


4 


5 
6 
7 
8 
9 
1 


Geweke test z value. An absolute value less than 1.96 indicates that the 
corresponding chain has passed the convergence test. 

The degrees of freedom of the multivariate-t. 

The probit coefficient of the class probability for class 1, defined in u 
The probit coefficient of the class membership 1 at Grade 7, defined in Eqn. (9). 
The probit coefficient of the class membership 2 at Grade 7, defined in Eqn. (9). 
The probit coefficient of the covariate at Grade 7, defined in rii 


0 The probit coefficient of the potential output Y at Grade 7, defined in Eqn. (9). 
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use uninformative priors which carry little information for model parameters. 
A burn-in period is run first to ensure estimates are based on the Markov chains 
that have converged. For testing convergence, the history plot is examined and 
the Geweke's z statistic is checked for each parameter. The 
Geweke's z statistics for all the parameters are smaller than 1.96, which indicates 
converged Markov chains. To make sure all the parameters are estimated 
accurately, the next 50, 000 iterations are then saved for data analysis. The ratio 
of Monte Carlo error (MCerror) to standard deviation (S.D.) for each parameter 
is smaller than or close to 0.05, which indicates parameter estimates are accurate 
(Spiegelhalter, Thomas, Best, & Lunn] 2003). In step 3, model selection indices 
are used to compare the ten models. The indices are listed in Table [9] And in 
step 4, the results obtained from the final selected model are interpreted. 

As suggested by Dhat.CAIC, Dhat.ssBIC, Dhat.BIC, and Dhat.AIC, without 
further substantive information, the TN-CXY model appears to be a good 
candidate for the best-fitting model. Table[10|provides the results of the TN-CXY 
REGMM model. It can be seen that (1) class 1 has a higher average initial level 
but a smaller average slope; (2) class 2 has larger variations for initial levels and 
slope; (3) the residual variance of class 2 is much larger than that of class 1; (4) in 
class 1 the initial level and the slope are significantly negatively correlated at the 
confidence level of 9596; (5) the missingness is not related to gender because none 
of the coefficients of gender are significant at the a level of 0.05; (6) at grade 11, 
adolescents in class 2 are more likely to miss tests than those in class 1 because 
the probit coefficient of class membership for grade 11 is significantly positive; 
and (7) at grades 8 and 10, students with higher potential scores are more likely 
to miss tests than the students having lower scores because the probit coefficients 
of the potential outcomes y at the two grades are significantly negative. 


6 Conclusions, Discussion and Future Research 


Based on the results from the five simulation studies, one can conclude that (1) 
almost all of the model selection indices, except for the rough DIC (RDIC), 
can correctly choose the true model with high certainty; (2) if the number 
of classes is correctly identified, then the Dbar-based indices perform better 
than the Dhat-based indices; if candidate models have different numbers of 
classes, then the Dhat-based indices might be used to select the best fit model; 
(3) across 5 studies, CAIC and BIC provide higher probabilities than those 
ssBIC, AIC, or DIC does. The results will help inform the selection of growth 
models by researchers seeking to provide people with accurate estimates of 
growth across a variety of possible contexts. The real data analysis demonstrated 
the application of the indices to typical longitudinal growth studies such as 
educational, psychological, and social research. 

'This study can be extended in many ways. For example, different versions 
of the likelihood function or more model selection indices can be studied and 
compared by using more practical statistical models. (1) As we stated in 
the section of Introduction, there are at least three challenges in proposing 
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new selection indices. The third challenge is about the likelihood function 
I(y|9). When latent variables involved, the likelihood can be an observed-data 
likelihood, a complete-data likelihood, or a conditional likelihood 
[2006]. In this study, we use a conditional joint loglikelihood, but in the future, 
the other versions of likelihood functions can be investigated. (2) Another future 
research of this study is to propose other model selection indices, such as Bayes 
factors. (3) This study focuses on latent growth models only. In the future, the 
performance of these selection indices can be studied by using other statistical 
models, such as survival models. 
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Abstract. In psychological research, class imbalance in binary outcome 
variables is a common occurrence, particularly in clinical variables (e.g., 
suicide outcomes). Class imbalance can present a number of difficulties 
for inference and prediction, prompting the development of a number 
of strategies that perform data augmentation through random sampling 
from just the positive cases, or from both the positive and negative cases. 
Through evaluation in benchmark datasets from computer science, these 
methods have shown marked improvements in predictive performance 
when the outcome is imbalanced. However, questions remain regarding 
generalizability to psychological data. To study this, we implemented 
a simulation study that tests a number of popular sampling strategies 
implemented in easy-to-use software, as well as in an empirical example 
focusing on the prediction of suicidal thoughts. In general, we found that 
while one sampling strategy demonstrated far worse performance even in 
comparison to no sampling, the other sampling methods performed simi- 
larly, evidencing slight improvements over no sampling. Further, we eval- 
uated the sampling strategies across different forms of cross-validation, 
model fit metrics, and machine learning algorithms. 


Keywords: Imbalanced data - Sampling strategies - Machine learning 


1 Introduction 


In psychological research, class imbalance in binary outcome variables (also re- 
ferred to as skew or rare events), most often occurs due to the underlying pop- 
ulation of interest having small proportions of individuals with positive cases 
(minority), such as in the case of study designs that are assessing the prevalence 
of suicidal attempts in the general population. While class imbalance can be 
dealt with through changes in study design, such as sampling among individu- 
als with a history of mental illness to increase the probability of observations 
having a history of suicide attempts, this can fundamentally alter the alignment 
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between the population of interest and the sample of which the data is collected 
from. 


In the presence of class imbalance, failure to utilize appropriate strategies 
has a number of consequences. For instance, even when explanation is the pri- 
mary aim, using logistic regression with skewed outcomes can result in under- 
estimated probabilities for the positive class [2001]. Additionally, 
one of the general strategies is to perform data augmentation through random 
sampling from just the positive cases, or from both the positive and negative 


cases. [Kovács] (2019) found that in general, any form of sampling improves upon 
the performance modeling the original dataset. 


However, even in areas of psychological research where imbalanced outcomes 
are extremely common, such as suicide, the use of sampling strategies is still ex- 
tremely rare (e.g., a recent tutorial on evaluating classification in suicide research 
does not mention sampling strategies; Mitchell, Cero, Littlefield, & Brown] 2021). 
While part of this lack of translation across disciplines may be due to relatively 
siloed research, it also may partially be attributed to a lack of generalizability 
in the findings. While a large number of studies have evaluated the relative per- 
formance of methods designed to overcome class imbalance, the vast majority of 
research focuses on evaluation in benchmark datasets with characteristics unique 
to that field of study (primarily computer science), which limits the generaliz- 
ability of these findings to areas with different types of data including psychology. 
'This is similar to the hype and promise of machine learning being somewhat di- 
luted by limitations to the data commonly found in psychological research (see 


Jacobucci & Grimm| 2020). 


'Thus, the goal of this study is to answer the question: Do minority case sam- 
pling approaches improve prediction with imbalanced outcomes in datasets with 
psychological variables? To accomplish this, we evaluated a number of sampling 
strategies commonly used for imbalanced outcomes in simulated data that is 
more in line with characteristics commonly found in psychological research. We 
followed this by applying those strategies to the prediction of suicidal ideation in 
a large public dataset. Additionally, we specifically focus on strategies for over- 
coming class imbalance that are already implemented in easy-to-use software. 
Our focus is on strategies that operate at the data level, as opposed to the model 
estimation phase. While we test two different algorithms, logistic regression and 
random forests, we put our focus on methods that do not involve the use of mis- 
classification costs for a number of reasons. The first is that sampling methods 
are easy to implement in easy-to-use software that pairs with many ML algo- 
rithms, meaning researchers won't face limitations with which algorithms can 
be compared. Further, sampling based methods have been studied more (e.g., 
thus often show up more 
in recommendations. And finally, assigning costs does not overcome potential 
issues of few to no positive cases being represented when the sample size is small 
and k-fold cross-validation (CV) is used. 
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1.1 Sampling Methods 


While a number of strategies have been proposed for supervised learning with 
imbalance data, possibly the two simplest are random over-sampling (OVER) 
and random under-sampling (UNDER). While OVER random samples from the 
minority case to produce an equal distribution of positive and negative cases, 
UNDER randomly removes majority cases to produce an equal distribution. 
As an example, of an original dataset with 10 positive cases and 100 negative 
cases, OVER would produce a new dataset with 100 positive and negative cases 
each, while UNDER would create a dataset with 10 positive and negative cases 
each. Both methods have well understood drawbacks: while UNDER discards 
potentially useful data, OVER increases the probability of overfitting 


Zabar, & Weiss| 2005). 


1.2 Synthetic Minority Over-Sampling Technique (SMOTE) 


SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer||2002) is an oversampling tech- 


nique that creates artificial minority class cases by using the k-nearest neighbors 
for a given minority case instead of oversampling randomly as in OVER. More 
specifically, for a specific minority case i, instead of just creating duplicates of 
that case, the SMOTE algorithm finds k similar minority cases to case i, and 
generates synthetic cases that take on a value for the predictor variables that 
represent a blend of the k-nearest neighbors. Thus, the newly created synthetic 
minority cases contain similar, not identical, predictor values to the k-nearest 
neighbors. Finally, the number of synthetic cases created for each minority case 
is a tuning parameter typically referred to as Npercent- 


1.3 Random Over-Sampling Examples (ROSE) 


While SMOTE only generates synthetic samples from the positive cases, ROSE 
uses the smoothed bootstrap to generate both negative 
and positive samples to create a new dataset that is more balanced. In the ROSE 
procedure, an observation is first drawn with a 5096 chance of belonging to each 
class. Given this observation, a new sample is then generated in its neighborhood 
(according to the predictor values), with the neighborhood chosen according to 
a kernel density estimate (for further detail, see[Menardi & Torelli} [2012). The 
user of ROSE is given discretion as to how much under-sample the negative 
cases, and to what degree over-sample the positive cases. 


Demonstration To demonstrate how SMOTE and ROSE randomly over sam- 
ple minority cases, we simulated 50 cases according to a linear logistic model 
with two predictors (regression coefficients of 0.2 and 0.4) and an intercept of -3, 
which resulted in 47 negative cases and 3 positive cases. T'his was followed by 
applying each method with the DMwR package and ROSE package 


(Lunardon, Menardi, & Torelli| 2014). The resulting datasets are displayed in 


Figure 
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Figure 1: The top plot is the original dataset, the middle figure is after applying 
SMOTE, while the bottom figure is after applying ROSE. 


Imbalance and Sampling 63 


We can see in Figure }1| that both methods oversample the positive cases, 
generating new data points based on the original positive cases. Additionally, 
the two methods handle the negative cases in different ways: SMOTE uses a hy- 
perparameter (we selected 500%) to select the percentage of negative cases that 
are selected relative to the number of positive cases generated; ROSE requires 
setting a hyperparameter to determine the percentage of positive cases that end 
up in the new dataset (defaults to 0.5), with the negative cases making up the 
remainder to have the sample size equal to the original sample. 


1.4 Comparison of Sampling Procedures 


SMOTE has been widely applied across an assortment of research domains. 
Evaluating its performance in applied data is complicated by a number of factors, 
but one primary concern is on the differences in which classification performance 
metrics are reported. For instance, [Nnamoko and Korkontzelos| used data 
with diabetes diagnosis as the outcome, but mainly focused on improvements 
due to SMOTE in accuracy. However, the calculation of accuracy is based on 
the class distribution, which is discrepant across the various ways that Nnamoko 
and Korkontzelos evaluated the use of SMOTE. Most studies have primarily 
focused on evaluating the area under the receiver operator characteristic (ROC) 
curve (AUC; [Hanley & McNeil} (1982), or the area under the precision-recall 
curve (AUPRC; 2009). There exist multiple additional methods 
that incorporate similar types of information (e.g., see 
[2015], however, these two have received the most coverage. Whereas the AUC 
encompasses the contrast between sensitivity and specificity, thus information 
regarding both classes, the AUPRC contrasts recall with precision, thus only 
encodes information regarding positive cases. While the AUC is more commonly 
used in practice, there are concerns regarding the AUC being misleading in the 


presence of imbalance (Lobo, Jiménez-Valverde, & Real| |2008), with findings 


that the AUPRC is more informative when classes are imbalanced 
[2015]. Therefore, we will focus on both the AUC and AUPRC, but 
give preference to the AUPRC. 

An additional complication in evaluating methods for handling imbalance is 
the research domain of concern. Many studies that evaluate methods for handling 
imbalance use benchmark datasets from that research area. For instance, a recent 
study by examined the bloom of cyanobacteria in rivers in 
South Korea. With this data, the researchers found differences in performance 
among various classifiers (e.g., ensembles outperformed single models), but only 
marginal performance gains in the application of SMOTE. Notably, they did not 
evaluate the AUPRC. In a different area of application, Zhu, Baesens, and vanden] 
examined class imbalance strategies in the area of customer 


churn prediction, evaluating performance in terms of the AUC, and comparing 
ensemble methods paired with various sampling strategies and cost-sensitive 
learning. Again, ensemble methods outperformed simpler algorithms, however, 
there did not seem to be a benefit to more complex sampling strategies above and 


beyond over or under-sampling. Finally, Demir and Sahin] (2022) examined the 


64 R. Jacobucci, X. Li 


impact of classification algorithms and oversampling methods for soil liquefaction 
evaluation, finding that SMOTE outperformed both OVER and ROSE. 


1.5 General Findings 


Among relatively simple strategies for handling class imbalance, over-sampling 


is typically preferred over under-sampling (e.g.,|Batista, Prati, & Monard| |2004 
Buda, Maki, & Mazurowski||2018). In simulated and benchmark datasets, 


(2020) compared under-sampling, over-sampling, and a hybrid of both to 
just the use of the original dataset, confirming prior research that over-sampling 
outperforms under-sampling. 

A further complication in this is that originally proposed over-sampling meth- 
ods can be subject to different interpretation, resulting in varying implementa- 
tions. To address this, tested four 
possible variants on the original SMOTE implementation, along with more re- 
cently proposed generalizations of SMOTE. On a number of benchmark datasets, 
they found that all of the variants outperformed random over-sampling and 
no sampling, with the highest performance attributed to the recently proposed 
Weighted-SMOTE por?) 

A recent study attempting to provide benchmark performance metrics for 
a host of recently proposed advancements on a large number of benchmark 
datasets, evaluated with multiple ML algorithms with repeated k-fold cross- 
validation [2019]. They found that the biggest improvements were at- 
tributed to the use of any reliable over-sampling method over no sampling, with 
much smaller improvements due to the use of the best performing methods over 
standard SMOTE oversampling. However, importantly for our purposes, this 
study did not test random over-sampling, and like most other studies, used a 
large number of benchmark datasets. 

One key piece in applying over-sampling is to ensure that augmented datasets 
are not created prior to splitting the dataset up into training and tests sets, as 
this can lead to overly-optimistic performance due to data leakage (e.g., 
[dewiele et al.|[2021). 'This can be attributed to copies, either exact or very similar, 
of original cases being included in both the training and tests sets. In R package 
caret [2008], the resampling is conducted inside of cross-validation or 
bootstrap sampling. As an example, in 5-fold CV, each partition that is created 
with 4/5ths of the sample is then subject to over-sampling, the model is trained, 
then tested on the 1/5th sample that was not subject to sampling. However, of 
note, if one has a true test set that is only used for assessing the final model's 
performance, sampling should not be conducted in this sample, as it is used only 
to test a previously trained model. 

Finally, much less research has focused on the interaction between the use 
of sampling and the actual sample size of the dataset. Studying this interaction 
is further complicated by the form of resampling used to evaluate prediction 
performance, as the most commonly used form, k-fold CV, has been shown to 


produce highly biased results in small samples (Vabalas, Gowen, Poliakoff, & 


2019). However, the presence of a binary outcome further complicates 
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defining what sample size is given that assessing the number of positive cases 
is more informative than the overall number of cases (cf. Peduzzi, Concato, 


Kemper, Holford, & Feinstein| |1996). 


2 Study 1 


2.1 Methods 


We specifically chose this study design as we believe that it mimics the structure 
found in the majority of clinical data that primarily includes self-report data. 
While we assess the influence of nonlinear relationships, we primarily simulate 
the data according to a linear model, as this is most in line with the results that 
utilize machine learning algorithms in clinical self-report data: if machine learn- 
ing outperforms linear models, the improvement in performance is most often 
negligible (Christodoulou et al] 
2021). For the simulation setup, we started by simulating standard 
normally distributed data with a sample size of 50,000. The cases not selected to 
train and test the methods were kept in order to produce performance metrics 
that serve as ground truth. 

In predictive tasks with class outcomes, there are often two layers of assess- 
ment. The first step in prediction-oriented tasks is often assessing the correspon- 
dence between the predicted probabilities and actual class labels, while further 
performance assessment can be taken in translating the predicted probabilities 
to predicted class labels to classify individuals. Given that much of psychological 
research is only focused with the first step, our aim is only assessing prediction 
performance. We interpret performance with respect to the AUC and AUPRC. 
While the AUPRC is more informative at higher degrees of imbalance, the AUC 
is much less likely to evidence floor effects, thus improving our ability to charac- 
terize its distribution. When making specific comparisons in performance across 
methods, we used an ANOVA with Tukey's HSD posthoc tests. 

With this setup, we varied a number of conditions across 200 repeats: To 
train and test the methods we tested sample sizes of 300, 1000, and 10,000. 
With these sample sizes, we simulated data following a logit link, while varying 
the intercept (bo) to control the level of class imbalance. We specifically tested 
values of -4 (= 0.02 positive), -3 (~ 0.05 positive), -2 (e 0.12 positive), -1 (= 0.27 
positive), and 0 (= balanced case). We tested the inclusion of 30 and 70 pre- 
dictors, with 1096 of the predictors having standardized coefficients of 0.2, 1096 
having 0.1, 1096 having 0.05, and the rest 7096 having coefficients of 0. Addition- 
ally, we added two standard normal predictors with unit weighted cosine and 
sine relationships with Y, and a tanh interaction between these variables with 
coefficients of 0.1. Although the exact functional form of these nonlinear relation- 
ships is unlikely to occur in psychological data, our focus is less on identifying 
the true relationships and more on determining whether nonlinearity interacted 
with imbalance to bias our model performance or algorithm selection. Following 
the logistic model, we also tested residual variances of 0.82 and 0.3. Once this 
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normally distributed version of Y was created, it was transformed into a prob- 
ability according to a logit link to a probability, followed by using a binomial 
distribution to generate values of 0 and 1. Finally, we included Bayesian logis- 
tic regression (Bayesian as this resulted in fewer convergence issues when the 
class imbalance was large and sample size small) and random forests 
[2001]. Our goal in this comparison was to test the potential of underfitting and 
overfitting, particularly given prior findings regarding overfitting with ROS. Our 
specific point of comparison is in assessing the performance of random forests 
with sampling to determine whether inflated AUC or AUPRC values are found. 

Outside of the simulated intercept values, our main point of evaluation con- 
cerned the resampling and sampling methods. For resampling, we tested the 
validation set approach or 10-fold CV. The validation set approach applied the 
sampling method on a training set that contained 7096 of N, followed by testing 
on a holdout set containing 30% of N that was not used to train the model. 
'The 10-fold CV approach used the sampling approach on each training set for 
each 10 iterations. Note that for both approaches the sampling method was used 
after splitting the sample and was not applied to the holdout set. Finally, our 
goal in assessing sampling methods was to test methods that are easy to apply 
in commonly used software. Given this, we focused on the methods available in 
the caret package in R. This included no sampling, UNDER, 
OVER, SMOTE, and ROSE. SMOTE is implemented in the DMwR package 


(Torgo| 2010), while ROSE is implemented in the ROSE package (Lunardon et; 


2014). We used the software defaults for both SMOTE and ROSE. 


2.2 Results 


Of the 200 replications across conditions, errors occurred in estimating models 
in a subset of conditions, namely due to the condition with a sample size of 300 
and intercept of -4 (11% errors). All other conditions had less than 2% errors. 
Given the breadth of results, we chose only to present a select subset of findings 
that highlight key points. 


Intercept and variability Possibly the largest influence of class imbalance is 
on the degree of variability to the AUC and AUPRC. This can be clearly seen in 
Figure[] With an intercept of -4, all of the sampling methods evidence a number 
of outliers that are strongly positively biased. However, this represented a quite 
small number of results, as the 95th percentile at an intercept of -4 was 0.13 for 
OVER and SMOTE. 

Additionally, we can see the median AUC and AUPRC values for the sam- 
pling methods get closer to the True performance as the imbalance becomes 
less. This is further influence by sample size and can be attributed to a lack of 
information when there are fewer positive cases, leading to further degrees of 
underfitting by default. As an example, when the intercept is -4, there is a dif- 
ference in AUC means of 0.06 between OVER and the True performance (0.596 
vs. 0.660), while for an intercept of 0 it is 0.04 (0.648 vs. 0.688). 
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Figure2: AUPRC and AUC values across sampling methods and simulated in- 
tercept. 
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Lastly, while the AUC is commonly labeled as biased in the presence of 
imbalanced data, it demonstrated similar performance across values of bọ as the 
AUPRC. Performance improves as the class distribution becomes more equal; 
this is most likely attributable to the increasing numbers of positive cases. In 
fact, when the intercept was -4, the AUC was on average 0.09 points higher when 
the sample size was 10,000 as opposed to 300, while the discrepancy fell to 0.06 
when the intercept was 0. 

We see similar effects with sample size in Figure |3| as one would expect. 
Larger sample size resulted in less variability, which further reduced the propen- 
sity to over- or under-estimated performance. For the AUC |'| the standard 
deviations were 0.10 for 300, 0.07 for 1,000, and 0.07 for 10,000. While a large 
number of AUC values for each of the sampling methods were greater than the 
True values in the smaller sample, the median values became closer to the True 
median scores in the larger sample sizes. This highlights improvements in per- 
formance and stability with greater N, and a worrisome level of biased outliers 
at small sample sizes. 
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Figure3: Sample size and AUC values. Note that we don't also depict the 


AUPRC as the behavior is the same, but the variance is overly wide due to 
averaging over the intercept values. 


1 We don’t report the standard deviations for the AUPRC as there were floor effects. 
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k-fold CV reduces variability Figure[4]displays the AUC values across both 
k-fold CV and validation set strategies, as well as using random forests and 
logistic regression. The first thing to note is the differences in variability across 
resampling methods, with k-fold CV having lower variance, particularly at a 
sample size of 300. This is in line with general recommendations to only use the 


validation set strategy in the presence of large sample sizes (i.e.,|James, Witten, 
Hastie, & Tibshirani||2013). Secondly, there do not seem to be mean differences 


across resampling methods, and only slight improvements due to random forests 
(as expected). Finally, we can see a strange interaction between the use of ROSE 
with k-fold CV, resulting in markedly worse performance. 
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Figure 4: Sample size and AUC values. Note that the AUPRC is not depicited as 
the behavior is the same, but the variance is overly wide due to averaging over 
the intercept values. 


Summary There was no form of sampling that resulted in universal best per- 
formance, but some general trends emerged. In line with prior research, some 
form of sampling generally outperformed no sampling. For both the linear and 
nonlinear simulated data, OVER had higher AUC values by 0.01 across condi- 
tions, which held even when imbalance was at its highest (bọ = —4) and a sample 
size of 300. Further, OVER and SMOTE consistently performed the best, with 
no statistical difference in their results averaged across conditions. For instance, 
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when the intercept was -4, OVER had a median AUPRC of 0.057, SMOTE of 
0.058, while no sampling was 0.052. It is important to note however that there 
was more variability within each type of sampling than between methods. Of 
note, there was not a notable distinction in the amount of performance variance 
across sampling methods: unfortunately, all sampling methods evidenced a large 
degree of variability when the sample size was small (Figure or there was 
high class imbalance (F igure B). Finally, ROSE performed the worst among the 
sampling methods, which was primarily due to problems in integrating ROSE 
with logistic regression and k-fold CV as seen in Figure [4] 

A surprising finding was that there were very little differences in the variabil- 
ity between the use of logistic regression and random forests. Random forests 
performed better than logistic regression, as expected given the two nonlinear 
effects, but importantly random forests did not have a greater propensity to 
overfit than logistic across the conditions and sampling methods. 


3 Study 2 


Data for Study 2 comes from the National Survey on Drug Use and Health 
from 2014 (NSDUH; [n.d.]. This survey focused on 
assessing the use of illicit drugs, alcohol, and tobacco among U.S. civilians 12 
years or older. For the purpose of our analysis, we focused on questions that 
assessed mental health issues. With a sample size of 55,271 and 3,148 variables, 
the dataset was pared down from the original dataset to just include thirty-nine 
predictors with the aim of predicting suicidal ideation (last 12 months; SUIC- 
THINK). Predictors included symptoms of depression and other mental health 
disorders, the impact of these symptoms on daily functioning, and four demo- 
graphic variables (gender, ethnicity, relationship status, age; dummy coded). The 
dataset can be freely downloaded from https: / /www.datafiles.samhsa.gov /study- 
dataset /national-survey-drug-use-and-health-2014-nsduh-2014-ds0001-nid16876. 

For the analysis, we used Bayesian logistic regression and random forests, 
while testing all of the above forms of handling imbalance. Secondly, we detail 
both the AUPRC and AUC given that the outcome variable had only 3.7% 
positive cases. Additionally, we separate the results by whether the sampling 
method for handling imbalance paired with the validation set approach or 10-fold 
CV. Finally, we do not report the results using ROSE given its poor performance 
in the simulation. 


3.1 Results 


As seen in Table [1] almost uniformly, the AUC and AUPRC values were higher 
when using 10-fold CV as opposed to the validation set approach, highlighting 
again that when comparing results across algorithms the same resampling strat- 
egy should be used. In assessing the AUC, OVER sampling performed slightly 
better than no sampling, while the opposite was true for the AUPRC. In fact, 
no sampling had the highest AUPRC values. In digging deeper to the simulation 
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results, at a sample size of 10,000, there were no statistical differences across 
the sampling methods for the AUPRC, with a similar lack of distinction even 
at smaller sample sizes. This empirical example further highlights that at large 
sample sizes, the use of sampling methods matters less, particularly when using 
the AUPRC. 


Table 1: Results from the Empirical Analysis. 
AUC AUPRC 
None UNDER OVER SMOTE None UNDER OVER SMOTE 
Logistic Regression 
Validation 0.801 0.805 0.805 0.772 0.302 0.289 0.301 0.224 
10-Fold 0.809 0.806 0.810 0.807 0.321 0.299 0.317 0.308 
Random Forest 

Validation 0.761 0.772 0.765 0.764 0.281 0.268 0.245 0.267 
10-Fold 0.768 0.781 0.771 0.774 0.304 0.267 0.260 0.283 


4 Conclusion 


This paper addressed a number of decision points that psychological researchers 
face when analyzing outcomes that exhibit imbalance. These decision points 
are particularly relevant when applying machine learning, as the importance of 
cross-validation and the accurate testing of hyperparameters become increasingly 
important. With this, there were a number of key takeaways: 


— k-fold CV should be preferred to the validation set approach when using 
sampling methods to address class imbalance. 

— The AUC did not demonstrate a bias in the presence of imbalance when 

using as an overall metric of fit (as opposed to examining the ROC curve). 

While OVER, UNDER, and SMOTE sampling approaches demonstrated 

improvements of no sampling, these improvements were extremely small. 

— The use of sampling did not increase the propensity to overfit, even when 
paired with random forests. 

— 'The ROSE method should not be used. 

— Simple models such as logistic regression may outperform complex machine 
learning algorithms in predicting psychological phenomena (i.e., 


c Grimm 2020) 


Additionally, although the use of sampling can improve mean/median esti- 
mates of performance in the presence of imbalance, there were not meaningful 
reductions in variability to the performance estimates. This finding would not 
have been identified by following the standard use of benchmark datasets, and 
is only possible through the use of simulation. 
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While this study was able to answer a number of questions, there are some 
important limitations. The first is that the data was simulated in a relatively 
simple way, following a logistic link with standard normal variables. Therefore, 
there remains uncertainty as to how the methods perform with datasets that ex- 
hibit levels of complexity falling in between our simulated data approach and the 
benchmark datasets commonly used to test the sampling approaches. A second 
limitation is that we only tested the sampling approaches that are easily ap- 
plied using the caret package in R, while prior research has found performance 
improvements in a number of more recently developed approaches, particularly 
generalizations of SMOTE. While R users can write their own functions im- 
plementing the additional varieties of SMOTE to be paired with caret, this is 
unlikely to occur in the majority of psychological applications. 
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Abstract. Data fusion approaches have been adopted to facilitate more 
complex analyses and produce more accurate results. Bayesian Synthe- 
sis is a relatively new approach to data fusion where results from the 
analysis of one dataset are used as prior information for the analysis 
of the next dataset. Datasets of interest are sequentially analyzed until 
a final posterior distribution is created, incorporating information from 
all candidate datasets, rather than simply combining the datasets into 
one large dataset and analyzing them simultaneously. One concern with 
this approach lies in the sequence of datasets being fused. This study 
examines whether the order of datasets matters when the datasets being 
fused each have substantially different sample sizes. The performance of 
Bayesian Synthesis with varied sample sizes is evaluated by examining 
results from simulated data with known population values under a va- 
riety of conditions. Results suggest that the order in which the dataset 
are fused can have a significant impact on the obtained estimates. 


Keywords: Bayesian synthesis - Data fusion - Exchangeability 


1 Introduction 


Researchers in psychology and in the social and behavioral sciences more broadly, 
have recently expressed concerns about a "replication crisis" (Maxwell, Lau, & 


2015). These concerns have driven researchers to explore and develop 


new strategies for analyzing data across multiple studies and summarizing re- 
sults. In an effort to combat the replication crisis, open-source data repositories 


have grown substantially (Bhattacharya & Saha| 2015), and this greater access 


* Early versions of this simulation study were presented at the International Meeting 
of the Psychometric Society and the Annual Meeting of the Florida Educational Re- 
search Association. This paper extends the simulation conditions in the first author's 
dissertation submitted in partial fulfillment of the doctoral degree at Arizona State 
University, Tempe. 
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to data both enables and requires the development of new strategies for explor- 
ing and analyzing this large amount of publicly available data. Data fusion is 
one method that has been shown to enable more complex and more appropriate 
models to be fit to fused ( combined) datasets than just a single dataset 
(Curran & Hussong| [2009} a edi Bayesian Synthesis is a recently 
proposed Bayesian approach to data fusion whereby results from the analysis of 
one dataset are used as prior information in the subsequent analysis of the next 
dataset [2017b), and those results are in turn used 
as prior information for the analysis of yet another dataset. Datasets of interest 
are sequentially analyzed in this manner until a final posterior distribution is 
created, which incorporates information from all datasets of interest. 
Conducting data fusion using the Bayesian Synthesis approach thus relies on 
the sequential updating of estimates as new information from each additional 
dataset becomes available (for complete technical ET on the approach and 
various in empirical applications see for example, 
& Hofer] BOIS] Johnson & Guttmannova] 2019) 20r) Marconlies BorrajporTb[po1s| 
B et bi 2018). This synthesis notion is expressed 


using Bayes theorem as 


P(Data|Unknowns) P(Unknowns) 
P(Data) 
x P(Data|Unknowns)P(Unknowns), 


P(Unknowns|Data) = 


where P( Unknowns) is the prior probability distribution for the unknown pa- 
rameters, P(Data|Unknowns) is the conditional probability of the data given 
the unknown parameters, and P(Unknowns|Data) is the posterior probabil- 
ity distribution for the unknown parameters given our data. Thus when two 
datasets are fused, the prior information about the unknown parameters can 
be considered equivalent to a data set that, when merged with the current 
data, supports the following Bayesian inference P (Unknowns| Data, Dataz) « 
P (Datag| Unknowns) P ( Unknowns| Data,). Here, P(Unknowns|Data;) is the 
posterior distribution that resulted from the first analysis where information 
from Data, was incorporated with P(Unknowns) and then serves as the prior 
distribution for the present analysis that incorporates the data in Datas. When 
k datasets are to be fused, the process can be denoted in a general form as 
P (Unknowns| Data, ..., Datay41) x P(Datay44| Unknowns) P (Unknowns| Data;, 
., Datay) with the priors and posterior distribution similarly updated. 

A major benefit of the Bayesian Synthesis approach over traditional frequen- 
tist approaches to data fusion is the ability to incorporate datasets for which 
raw data is not available. In this manner, the Bayesian Synthesis approach pro- 
vides an alternative to the necessity to analyze the raw data and instead uses 
the estimates and summary statistics from the examined studies to incorpo- 
rate into the prior information. The approach therefore utilizes point summary 
estimates of the posterior distributions instead of the actual full posterior dis- 
tributions as required by a fully Bayesian execution of this Bayesian Synthesis 
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approach. Bayesian Synthesis thereby enables summary information from pub- 
lished (or unpublished) research to be incorporated as prior information (i.e., 
an informative prior) for the analysis of another dataset. The sequential use of 
informative priors that are based on the information in past data provides an 
extra source of information to estimate model parameters and this additional 
information can effectively aid in the accuracy of parameters estimation and in 
the interpretation of results. However, one concern with the Bayesian Synthe- 
sis approach is that it heavily relies on updating the information as new data 
summary statistics become available, therefore the order in which the data are 


sequentially analyzed may have an impact on the results (Marcoulides| |2017b). 


Theoretically, in the Bayesian Synthesis approach this should not be a concern 
due to the conventional Bayesian exchangeability assumption 
[1974], however, Bayesian Synthesis utilizes point summary estimates of the pos- 
terior distributions instead of the full posterior distribution as required for a 
fully Bayesian execution of this approach. While this has the potential to intro- 
duce some bias, using point summary estimates of the posterior distributions 
greatly increases the ease of execution and enables researchers to straightfor- 
wardly implement the Bayesian Synthesis approach in standard programs like 


Mplus (Muthen & Muthen| |2017). The cost of potentially introducing bias may 


outweigh the difficulties of incorporating the full distributions in the sequential 


analysis (Marcoulides] 2017b). 


'To address the concern about the order of the datasets being analyzed, 
examined the exchangeability assumption and found that the 
order of analysis did not meaningfully impact the final data fusion results. Similar 
conclusions regarding exchangeability were also recently suggested by [Miocevic,] 
[Levy, and Savord] (2020). One limitation with these conclusions is that they were 
based on analyzed datasets that were from similarly-sized large samples. There- 
fore, it is still unknown whether the order of datasets matters when the datasets 
being fused each have substantially different sample sizes (as is quite common 
with empirical data). For example, it may be that beginning the Bayesian Syn- 
thesis approach with the analysis of a large dataset produces a substantially 
biased final posterior distribution when the other sequentially analyzed datasets 
are much smaller, or vice versa. 


In this study, we focus on this unexamined scenario in which there are multi- 
ple datasets of both small and large sample sizes. Our main question is whether 
the order in which datasets are incorporated in the Bayesian Synthesis process 
will impact the results when one dataset is substantially larger than the rest. To 
evaluate the performance of Bayesian Synthesis with varied sample sizes, results 
from simulated data with known population values will be examined under a 
variety of design study conditions. We conclude with a discussion of the results, 
implications of the findings, and suggestions for further research. 
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2 Methods 


2.1 Monte Carlo Data Simulation 


In order to systematically evaluate the exchangeability of datasets with varying 
sample sizes in the Bayesian Synthesis approach, simulated data using Monte 
Carlo techniques were analyzed under a variety of longitudinal data design con- 
ditions. All simulated data were generated using R (R Development Core Team 
2010), and analyzed in Mplus (Muthen & Muthen| |2017) through R using the 


MplusAutomation package (Hallquist & Wiley| 2014). The analyses were spec- 
ified to use the Gibbs (PX1) algorithm with a minimum of 50,000 iterations, 
using the Potential Scale Reduction (PSR) convergence criteria of 1.05 
[2014], a median summarized posterior, 250 replications, and two chains 
without thinning. 

'The simulated data were modeled after the|Marcoulides and Grimm] 
study, which analyzed six longitudinal studies measuring students’ mathematics 
ability. T'hese studied six datasets varied in their sample size, timing, and num- 
ber of measurement occasions. As |Marcoulides and Grimm] (2017) showed that 
mathematics ability increased as children got older, we used the linear growth 
model to generate the simulated data. In this model, individuals (n) are mea- 
sured on their math abilities (y) across multiple measurement occasions (t), 
using the data generation model 


t—k 
n = Non + ( k ) Nin + €tn; (1) 
2 


where 5o, is an individual's latent intercept when time t equals zero, si, is 
an individual's latent slope when time ¢ equals zero, kı and kz are functional 
variables to center the intercept and scale the slope, and e;, is the residual at 
time £ for individual n. In this model, we assume the two latent variables follow 
a multivariate normal distribution, [gos, mis] ~ N(B, Y); and the residuals 
are assumed to follow a normal distribution, ei,  N(0, o2), with mean 0 and 
constant variance. 

In all simulation conditions, the latent intercept and slope means were fixed 
at Bintercept= -2 and Slope = 0.4, and the residual variance o2 was fixed at 0.10 
to reflect ae amounts of residuals. [Asparouhov and Muthén] (2010 0) and [Mc-] 
[Neish] (2016| 2016) indicated that, compared to other parameters of ee Tacos growth 
model, the “I ies of prior distributions for the variance-covariance matrix (W) 
in this model is extremely important. Therefore, we additionally varied the 
W matrix to reflect small, medium, and large magnitudes of their variances, 
and zero and small magnitudes of their covariances, resulting in the following 


: : T _ | 0.20 0.0 _ | 0.70 0.05 
three variance-covariance matrices: V, = | 0.0 al P = [ois ral , and 
0.40 0.20 


0.20 0.40 
Table |1| below. Figure |1| provides an illustrative display of data score plots for 


W3 = . A summary of these population parameters is presented in 
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one simulated sample of N = 50 observations based on these three variance- 
covariance matrices for a simulated data design condition comprised of observa- 
tions obtained across 3 assessment occasions taken every 5 years starting at age 
5. 


Table 1: Population matrices and covariance parameters. 
2 


v Bintercept Psiope Te 
bes A -2.00 0.40 0.10 
jooo] aoo oso oo 
[249029] aw oa om 


For each variance-covariance matrix condition considered, six datasets were 
simulated. Because the impact of sample size on exchangeability may also depend 
on how well the particular sample matches the population of interest, the six 
datasets were varied with respect to the number of assessments, years between 
assessments, the age of participants’ first assessment, and the sample size. The 
data patterns for these six datasets are presented in Table|2| and are meant to 
reflect the full growth trajectory (i.e., across the full age range of interest) with 
early and late age ranges, as well as large and small numbers of observations 
across different numbers of assessment occasions. As indicated, the sample sizes 
for these six datasets were also varied across each W matrix condition, such 
that each of these datasets were simulated to have a sample size of 1000 and 
incorporated into the Bayesian synthesis approach as the first dataset, randomly 
varying the order of the remaining 5 datasets each with sample size of 50, and 
then again as the last dataset, randomly varying the order of the preceding 5 
datasets. These different sample sizes of 50 and 1,000 were selected to reflect 
small and large sample studies that are commonly encountered in longitudinal 
data analyses Paxton, Curran, Bolles, Kirby, & Chen] 2001). 

The different specifications for the simulated data presented in detail in Ta- 
ble[2|resulted in a total of 36 different simulated data conditions (3 W matrices, 
6 data patterns, and 2 fusing sequences). For example, for the first W4 matrix 
condition, if dataset 1 (measured 3 times with five years between assessments, 
starting at age 5) was simulated to have 1000 individuals, the other 5 datasets 
were then simulated to have a sample size of 50 observations. Thus, the Bayesian 
Synthesis approach begins with the analysis of dataset 1 with 1,000 observations 
and produces a posterior point estimate that is then used as the informative 
prior for the analysis of say dataset 2 with 50 individuals. This process contin- 
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Figure 1: Illustrative data score plots for N = 50 for the first data design. 
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Table 2: List of the patterns of measurement occasions to be used in the simulated 
data 
Dataset Number of Assessments Years between Assessments StaringAge 


1 3 5 
2 10 1 5 
3 10 0.5 2.5 
4 10 0.5 10 
5 3 0.5 4 
6 3 1 11 


ues accordingly for each of the remaining datasets until a final posterior point 
estimate is produced that incorporates information from all 6 datasets. As noted 
above, this same process is also conducted again but instead with the 1000 ob- 
servations utilized as the last dataset while randomly varying the order of the 
preceding 5 datasets each with 50 observations (see additional details below). 


2.2 The Bayesian Synthesis Approach 


After simulating the datasets according to the conditions described above, we 
conducted data fusion using the Bayesian Synthesis approach. We began this 
sequential data integration process with non-informative priors for the analysis 
of the first dataset because, according to[Asparouhov and Muthén| (2010), these 
initial non-informative priors should not introduce bias, even in small sample 
sizes. The rationale for using non-informative priors also follows recommenda- 
tions provided by {Gelman et al.| to “...let the data speak for themselves, 
so that inferences are unaffected by information external to the current data (pg. 
51)". This is because “the information about model parameters contae in the 
data will far outweigh any reasonable prior probability specification” 
[et al] [2014). Similar recommendations concerning the use of non- Am 
priors for estimation in growth curve analyses were also provided by [Liu, Zhang] 
[and Grimm] (2016). So, unless dependable prior information about the range 
of possible values that model parameters might take is available, beginning the 
sequential data integration process with non-informative priors appears to be 
preferable to simply taking a guess at the values for the priors and using infor- 
mative inaccurate priors, which are known to yield less accurate estimates than 
non-informative priors (e.g., [2018]. 

'Thus, the Bayesian Synthesis approach begins by using non-informative pri- 
ors as parameters for the first data set. This implies that for the intercept and 
slope means a Normal prior of the form N(mean, variance) is used with N(0, 
1019), which is the default for a non-informative Normal prior in Mplus 
[e Muthen| 2002). 'Then, for the parameters in the V matrix the Inverse Wishart 
prior IW(0,-3) is used, which is also the default non-informative prior in Mplus 


(Muthen Muthen & Muthen Muthen| [2002] Muthen & Muthen| 2002). This prior is of the general form IW(S, d), where 
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d(T intercept) d(ors) 

d(ors) d(oz1,,) ? 
with the estimated intercept (o fntercept) and slope (02,,,,) variances and their co- 
variance (cra). Finally, the Inverse Gamma IG(-1, 0) for the residual variance, 
which is also the default non-informative prior in Mplus 
2017). The Inverse Gamma is of the general form IG(a, 8), in which a = vo/2 
and 8 = voa?/2, and where o2 can be interpreted as the best estimate of the 
variance and vp can be interpreted as a pseudo-sample size. Upon analyzing the 
first data set based on the non-informative priors, posterior point summary es- 
timates for the 8, WV, and o2 parameters are then sequentially substituted into 
the respective priors for the next data analysis and the pseudo-sample size of 
the current data set is then changed by the sample size of the previous data set. 
'This process then makes the priors in the subsequent analyses informative priors 
and continues sequentially until the sixth and final data set is analyzed and the 
final posterior distribution and point estimates produced. The simulation results 
for these analyses are presented in Tables [3a] through [Ba] 

'To further investigate the extent to which the use of non-informative versus 
informative priors at the outset of Bayesian Synthesis might also induce some sort 
of bias into the obtained results, we also performed all sequential data integration 
processes using informative priors for the analysis of the first dataset. Given that 
in practice it may be unlikely for a researcher to have exactly accurate prior 
information regarding the parameter values, we followed the recommendations 
of[Depaoli| (2014) and|Finch and Miller] (2019) to use “informative” priors as those 
in which the specified priors correspond to the estimated maximum likelihood 
growth parameters for each model. Using these estimated values as informative 
priors, the analyses were then repeated using the 36 different specified simulation 
data conditions (3 V matrices, 6 data patterns, and 2 fusing sequences) and are 


presented in Table |3b|to Table 


d is the pseudo-sample size and S is the scale matrix 


2.3 Parameter Evaluation 


The final posterior point summary estimates obtained for each of the 36 different 
simulated data conditions were evaluated in terms of their raw bias (B (ĝ)), rel- 
ative bias (RB (g)), accuracy (RMSE (ĝ)), and efficiency (Efficiency(9). These 
criteria were selected based on past research on dependable parameter evaluation 


benchmarks (Bandalos & Gagne} 2012| |Bandalos & Leite| 2013). These criteria 


were computed based on the following formulas: 
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R iĝ 2 
RMSE (g) = See W" (4) 


XE $--8) 


Efficiency (§) = R-1 


(5) 
where R represents the total number of simulation replications, which is 250 in 
our study; j is the estimated parameter; y is the known population value for our 
simulation; and ĝ is the average parameter estimate. 

Raw bias, accuracy, and efficiency in essence evaluate the average deviation, 
the square root of the average deviation, and the variability of the final poste- 
rior distribution means, respectively. Nonzero positive or negative values of raw 
bias indicates overestimation or underestimation respectively. Lower values of 
accuracy correspond to more precise estimates of the parameters, or estimates 
of parameters that exhibit a smaller range of error [2012]. 
Values closer to zero correspond to more efficient estimates of the parameters. 
In other words, smaller values correspond to a smaller range of variability, or 
higher consistency of estimation. In contrast to these criteria, the magnitude of 
relative bias is expressed on the percentage scale and indicates the percent devi- 
ation of the estimate from the population parameter. This measure is ideal for 
comparisons of the magnitude of bias across different design conditions 
[2002]. To evaluate relative bias, it has been suggested that values 
less than 5% reflect ignorable bias, values between 5% to 10% indicate moder- 
ate bias, and values larger than 10% are considered substantial bias 
[2002]. Because multiple simulation replications were conducted (in 
this case 250 replications were analyzed for all linear growth models examined), 
average values of relative bias for each evaluated parameter are reported across 
all the replications. In general, the values of raw bias, accuracy, and efficiency 
are typically much harder to unravel, whereas values of relative bias are much 
easier to interpret; we therefore pay extra attention to disentangling obtained 
relative bias values in reviewing the crucial findings in our study. 


3 Results 


The simulation results for examining the performance of the exchangeability 
principle applied within the Bayesian Synthesis approach are presented in Ta- 
bles [3a] through Within each table, the results are organized based on the 
magnitude of the variance-covariance V matrix (in order of small, medium, and 
large magnitudes) and the order of data fusion (i.e., with the large dataset fused 
first versus last). The computed parameter estimates reported include the latent 
variable means f rercep: ANd slope, their variances and covariances in terms of 
intercept? C Slope C18; and the residual variance 92. Each table presents findings 
based on the designated evaluation criteria (raw bias, relative bias, accuracy, 
and efficiency) for each of the parameter estimates across the examined data 
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design conditions with both non-informative and informative priors. Except for 
estimates of Bintercept, positive values of the criteria indicate positive bias or 
overestimation and negative values indicate negative bias or underestimation. 
Because intercept Was fixed in the simulation conditions at a negative value, 
obtaining a negative bias actually corresponds to overestimation and a positive 
bias corresponds to underestimation. 

The results for the first data design condition with specified covariance matri- 
0.20 0.0 | p= k a ande [es 0.20 
0.00.01] ?  |0.050.10]" ? — [0.20 0.40 
in Table The first data design condition comprised observations obtained 
across 3 assessment occasions taken every 5 years starting at age 5. This dataset 
reflected the feature of large breadth with small numbers of assessments while 
covering different age ranges. The reported results correspond to those obtained 
when (i) a sample size of 1000 observations is incorporated as the first dataset 
into the Bayesian synthesis approach while randomly varying the order of the 
remaining 5 datasets each with a sample size of 50, and (ii) a sample size of 1000 
observations is incorporated into the Bayesian synthesis approach as the last 
dataset while randomly varying the order of the preceding 5 datasets each with 
a sample of size 50. To make this distinction clear, the results are labelled in 
this and all subsequent tables as FIRST and LAST for each reported evaluation 
criterion. 

In general, the obtained values for the raw bias, accuracy, and efficiency 
criteria are larger when the larger dataset is fused last instead of first in the 
Bayesian synthesis approach. However, these values are all still very close to zero, 
indicating precise and efficient estimates of the parameters. These obtained bias 
values also appear to be similar regardless of which specified variance-covariance 
matrix is examined. Looking for example at the estimated intercept variance 
(Cintercept) for the first dataset condition in Table 3a| some patterns of results 
can be detected. For instance, the obtained values for the raw bias (.0011, .0048, 
and .0029), accuracy (.0133, .0331, and .0212), and efficiency (.0133, .0327, .0210) 
when the large dataset was fused first are negligible. However, when the large 
dataset is fused last, the estimated intercept variance values increase (though are 
still rather close to zero) for the raw bias (.1244, .1300, and .1286), the accuracy 
(.2664, .2828, and .2776), and efficiency (.2355, .2511, .2459). Similar patterns of 
results are observed when informative priors are used to analyze the large data 
set first (see Table 3b). As indicated previously, because the values of raw bias, 
accuracy, and efficiency are typically more challenging to unravel, we instead 
pay extra attention to disentangling the obtained relative bias values as these 
are generally easier to interpret. 

When examining the relative bias criterion under the three variance-covariance 
matrixes, some interesting patterns of results are again revealed. Specifically, 
with initial non-informative priors it can be seen that the relative bias for the 
estimated intercept variance (07,,,,,,,;) increases from ignorable sizes (0.546%, 
0.692%, and 0.740%, respectively) when the large dataset was fused first into 
substantially biased values (62.208%, 18.572%, and 32.161%, respectively) when 
the large dataset was fused last. The estimates of the slope variance (CStope) also 


ces Wy = | are presented 
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showed somewhat similar patterns of results, but different magnitudes when fus- 
ing the larger sample dataset last. The variances changed from ignorable biased 
(0.72096, 0.88096, and 0.80296, respectively) when the large dataset was fused 
first, to substantially biased values (58.1296) for the small magnitude covariance 
matrix (V1) and moderately biased (8.31%, 6.336%) for medium and large mag- 
nitude covariance matrixes (V5 and V3) when the larger dataset was fused last. 
Another sizable amount of relative bias was also observed when examining the 
magnitude of the intercept slope covariance value (ors) for the V? variance- 
covariance matrix, shifting from ignorable bias (0.384%) when the large dataset 
was fused first to substantially biased (-12.984%) when the larger dataset was 
fused last. 


In contrast, when initial informative priors are used, the relative bias for the 
residual variance when the large sample is incorporated into the Bayesian syn- 
thesis approach as the last dataset were found to be moderate across the three 
covariance matrix conditions (7.248 96, 5.24496, and 5.10896). Additionally, the 
relative bias for the estimated intercept (07,,,,,,,,;) and slope (o2,,,,) variances 
increased from ignorable sizes when the large dataset was fused first into sub- 
stantially biased for the first covariance matrix condition (12.01% and 16.52% 
respectively) when the large dataset was fused last. 


The observed relative bias for the variance of the intercept, the slope, and 
to a lesser degree the intercept-slope covariance in this first data design condi- 
tion highlight the importance that the order of fusing the datasets can play in 
the estimation of parameters when implementing Bayesian Synthesis strategies. 
Interestingly, this finding is not in line with past research that has suggested 
that the order of data fusion does not meaningfully impact the final posterior 
distribution results 2020). When the data 
sets being fused are of differing sizes (50 vs. 1,000), ending with the fusion 
and analysis of a large dataset can in fact produce a substantially biased final 
posterior distribution when the other sequentially analyzed datasets are much 
smaller. However, these results are only discernable when using the measure of 
relative bias, they do not appear sizeable when examining the values of raw bias, 
accuracy, or efficiency. 


The results for the second simulated data design condition for the variance- 
covariance matrices W1, W2, and V, are presented in Table The second 
data design condition comprised of observations across 10 assessment occasions, 
measured every year, starting from age 5. This simulated condition reflected the 
feature of large breadth of measurement years covering different age ranges. As 
described previously, the reported results correspond to those obtained when a 
sample size of 1000 observations is incorporated as the first dataset and again as 
the last dataset while randomly varying the order of the other 5 datasets each 
with 50 observations. The results are similarly labelled as FIRST and LAST for 
each evaluation criterion examined. 

In general, the obtained values for the raw bias, accuracy, and efficiency 
criteria reflect similar results to those observed for the first simulated data design 
condition. Looking for example at the estimated intercept variance (T ntercept) 
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Table 3a: Data Condition 1 Using Initial Non-Informative Priors — Parameter 
Evaluation Criteria Results 


Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0005? -0.0009 0.4840? -0.9440 0.0033? 0.0055 0.0033? 0.0054 
o2 0.0006” -0.0002 0.5600” -0.2200 0.0033^ 0.0042 0.0033” 0.0042 
0.0005* -0.0003 0.5000* -0.2880 0.0034? 0.0053 0.0033* 0.0053 


0.0001 -0.0154  —— —— 0.0019 0.0323 0.0019 0.0284 
CIS 0.0002 0.0065 0.3840 -12.984 0.0083 0.0389 0.0083 0.0384 
0.0000 0.0040 0.3520 0.1960 0.0140 0.0614 0.0140 0.0614 


0.0007 0.0016 -0.0334 -0.0776 0.0163 0.0172 0.0163 0.0171 
Bintercept 0.0011 0.0020 -0.0570 -0.0986 0.0267 0.0267 0.0267 0.0266 
0.0007 0.0015 -0.0354 -0.0740 0.0212 0.0209 0.0212 0.0208 


-0.0001 -0.0001 -0.0220 -0.0130 0.0034 0.0034 0.0034 0.0034 
Bstope 0.0001 0.0002 0.0130 0.0450 0.0091 0.0092 0.0091 0.0092 
0.0006 0.0008 0.1430 0.2000 0.0181 0.1790 0.0181 0.0179 


0.0011 0.1244 0.5460 62.208 0.0133 0.2664 0.0133 0.2355 
OŽatercept 0.0048 0.1300 0.6920 18.573 0.0331 0.2828 0.0327 0.2511 
0.0029 0.1286 0.7370 32.161 0.0212 0.2776 0.0210 0.2459 


0.0001 0.0058 0.7200 58.120 0.0006 0.0107 0.0006 0.0090 
C Slope 0.0009 0.0083 0.8800 8.312 0.0043 0.0189 0.0042 0.0169 
0.0032 0.0253 0.8020 6.336 0.0168 0.0589 0.0165 0.0531 


Note: ® Denotes results for covariance matrix Y1, P denotes results for co- 
variance matrix VU», and © denotes results for covariance matrix V5. The 
relative bias for estimates of the intercept-slope covariance for V; cannot 
be computed, as the population value was zero. For Bintercept, negative bias 
corresponds to overestimation and positive bias corresponds to underesti- 
mation. For all other parameters values negative bias corresponds to under- 
estimation and positive bias corresponds to overestimation. Bolded values 
indicate moderate or substantial bias. 
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Table 3b: Data Condition 1 Using Initial Informative Priors - Parameter Eval- 
uation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0003? 0.0072 0.3280? 7.2480 0.0031? 0.0169 0.0031? 0.0152 
ae 0.0004^ 0.0052 0.3640 5.2440 0.0032P 0.0220 0.0031^ 0.0214 
0.0003* 0.0051 0.3240* 5.1080 0.0031° 0.0172 0.0031* 0.0164 


-0.0001 -0.0050 —— mE 0.0020 0.0233 0.0020 0.0228 
CIS -0.0003 -0.0025 -0.6880 -4.9920 0.0085 0.0380 0.0085 0.0379 
-0.0006 -0.0098 -0.2820 -4.9060 0.0148 0.0615 0.0148 0.0607 


0.0000 0.0004 -0.0016 -0.0206 0.0156 0.0166 0.0156 0.0166 
Bintercept -0.0003 0.0009 0.0130 -0.0468 0.0254 0.0254 0.0254 0.0254 
-0.0004 0.0011 0.0200 -0.0528 0.0202 0.0217 0.0202 0.0217 


-0.0002 0.0000 -0.0420 0.0010 0.0033 0.0036 0.0033 0.0036 
Bstope -0.0002 0.0003 -0.0490 0.0700 0.0093 0.0094 0.0093 0.0094 
-0.0004 0.0008 -0.0940 0.1930 0.0185 0.0196 0.0185 0.0196 


0.0007 0.0240 0.3260 12.0100 0.0136 0.1740 0.0136 0.1723 
O intercept 0.0014 -0.0103 0.1994 -1.4657 0.0337 0.2292 0.0337 0.2290 
0.0015 0.0128 0.3690 3.1940 0.0216 0.1890 0.0216 0.1886 


0.0000 0.0017 -0.3200 16.520 0.0006 0.0066 0.0006 0.0064 
Cope 0.0002 0.0006 0.1920 0.6080 0.0042 0.0153 0.0042 0.0152 
0.0004 0.0009 0.0960 0.2210 0.0159 0.0416 0.0159 0.0416 


Note: Same as Table 3a. 
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for the first dataset condition in Table the same patterns of results can be 
detected. For instance, the obtained values for the raw bias (.0064, .0219, and 
.0127), accuracy (.0196, .0685, and .0396), and efficiency (.0186, .0649, .0375) 
when the large dataset was fused first are negligible. However, when the large 
dataset is fused last, the estimated intercept variance values increase (though 
are still rather close to zero) for the raw bias (.0688, .0756, and .0721), accuracy 
(1444, .1521, and .1474), and efficiency (.1268, .1319, .1285). 


Focusing on the measure of relative bias with initial non-informative priors, 
the biased values are observed when comparing estimates to the true popula- 
tion values when the larger dataset was fused last in the Bayesian synthesis 
approach and regardless of the specified variance-covariance matrix examined. 
Specifically, one can see that the relative bias for the estimated intercept vari- 
ance (OF ntercept) increases from ignorable sizes (3.19296, 3.12996, and 3.18596, 
respectively) when the large dataset was fused first into substantially biased 
values (34.414%, 10.804%, and 18.013906, respectively) when the large dataset is 
fused last. The estimates of the slope variance (TStope) also showed somewhat 
similar patterns of results, but with somewhat different magnitudes when fus- 
ing the larger sample dataset last. The variances changed from ignorable bias 
(1.52%, 1.56%, and 1.89%, respectively) when the large dataset was fused first, 
to substantially biased values (33.92%) for the small magnitude covariance ma- 
trix (V1) to moderately biased (5.86%) for medium magnitude covariance matrix 
(V2) to ignorable (4.53%) for the large magnitude covariance matrix (V3) when 
the larger dataset was fused last. A sizable amount of relative bias was again 
observed when examining the magnitude of the intercept slope covariance value 
(org) for the Wg variance-covariance matrix, shifting from ignorable bias (1.84%) 
when the large dataset was fused first to substantially biased (-10.696%) when 
the larger dataset was fused last. Table [4b] displays the results for when initial 
informative priors are used in the data fusion process. Here, moderate bias is 
only observed for the estimates of the intercept variance p and slope 
variance (TStope) for the first covariance matrix condition (V1) when the large 
dataset is analyzed last (7.706% and 6.96% respectively). 


The findings obtained under the second simulated data design condition once 
again highlight the importance that the order of fusing the datasets plays in 
the estimation of parameters when implementing Bayesian Synthesis strategies. 
Fusing data in which a larger dataset is fused last can produce a substantially 
biased final posterior distribution when the other sequentially analyzed datasets 
are much smaller. 


It is important to note that these same patterns of results were also observed 
for the both the third (see Table and the fourth (see Table |6a) simulated 
data design conditions using both non-informative and informative priors for 
the initial dataset in the data fusion process (see Tables [5b] and [6b). The third 
data design condition encompassed a longitudinal study with 10 assessment oc- 
casions, measured every 6 months, starting from age 2.5. This data design was 
meant to reflect small breadth studies that start at an early age range but with 
large numbers of observations. The fourth data design condition also covered 


Table 4a: Data Condition 1 Using Initial Non-Informative Priors - Parameter 
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Evaluation Criteria Results 


Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0002^ 0.0000 -0.208? 0.0080 0.0051? 0.0016 0.0051? 0.0016 
o2 0.0002” 0.0001 -0.232 0.1360 0.0051" 0.0016 0.0051^ 0.0016 
0.0002° 0.0001 -0.236° 0.1160 0.0051° 0.0016 0.0051° 0.0016 
0.0001 -0.0093  — i 0.0029 0.0196 0.0029 0.0173 
ors 0.0009 -0.0053 1.8400 -10.696 0.0172 0.0274 0.0171 0.0269 
0.0045 -0.0012 2.274 -0.592 0.0297 0.0416 0.0293 0.0416 
-0.0019 -0.0010 0.0940 0.0506 0.0151 0.0152 0.0150 0.0152 
Bintercept -0.0034 -0.0024 0.1694 0.1224 0.0260 0.0256 0.0257 0.0255 
-0.0024 -0.0019 0.1194 0.0930 0.0206 0.0202 0.0204 0.0201 
0.0001 0.0001 0.0250 0.0350 0.0031 0.0032 0.0031 0.0032 
Bstope 0.0000 0.0003 0.0006 0.0750 0.0089 0.0089 0.0089 0.0089 
0.0004 0.0002 -0.1010 -0.0540 0.0177 0.0171 0.0177 0.0171 
0.0064 0.0688 3.1920 34.414 0.0196 0.1444 0.0186 0.1268 
Tintercept 0.0219 0.0756 3.1286 10.804 0.0685 0.1521 0.0649 0.1319 
0.0127 0.0721 3.1850 18.013 0.0396 0.1474 0.0375 0.1285 
0.0002 0.0034 1.5200 33.920 0.0010 0.0060 0.0010 0.0050 
Tope 0.0016 0.0059 1.5600 5.860 0.0090 0.0123 0.0089 0.0108 
0.0076 0.0174 1.8920 4.531 0.0363 0.0383 0.0355 0.0342 


Note: Same as Table 3a. 
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Parameter 


Raw Bias 


Relative Bias 


Accuracy 


Efficiency 


OIS 


Bintercept 


D slope 


2 
O Intercept 


2 
C Slope 


FIRST LAST 


FIRST LAST 


FIRST LAST 


FIRST LAST 


0.0000? 0.0010 
0.0000P 0.0004 
0.0001* 0.0003 


0.0000 -0.0016 
0.0009 0.0021 
0.0040 0.0030 


-0.0001 0.0000 
-0.0005 -0.0003 
-0.0004 0.0001 


-0.0002 0.0000 
-0.0007 -0.0003 
-0.0010 -0.0007 


0.0063 
0.0225 
0.0133 


0.0154 
0.0129 
0.0173 


0.0001 
0.0015 
0.0067 


0.0007 
0.0001 
0.0032 


-0.02407 1.0080 
0.0440" 0.3760 
0.0600° 0.3200 


1.7600 
2.0120 


4.1680 
1.5140 


0.0036 
0.0252 
0.0224 


-0.0006 
0.0144 
-0.0034 


-0.0580 
-0.1690 
-0.2580 


-0.0120 
-0.0710 
-0.1760 


3.1580 
3.2171 
3.3130 


7.7060 
1.8446 
4.3160 


1.1200 
1.4920 
1.6710 


6.9600 
0.0640 
0.7950 


0.0048" 0.0036 
0.0049” 0.0025 
0.0048° 0.0020 


0.0031 
0.0183 
0.0307 


0.0138 
0.0278 
0.0378 


0.0147 
0.0255 
0.0198 


0.0157 
0.0263 
0.0219 


0.0036 
0.0100 
0.0194 


0.0035 
0.0097 
0.0196 


0.0192 
0.0672 
0.0385 


0.0889 
0.1236 
0.1001 


0.0010 
0.0089 
0.0376 


0.0043 
0.0114 
0.0340 


0.0048" 0.0035 
0.0049" 0.0025 
0.0048° 0.0020 


0.0031 
0.0182 
0.0304 


0.0137 
0.0277 
0.0376 


0.0147 
0.0255 
0.0198 


0.0157 
0.0263 
0.0219 


0.0036 
0.0100 
0.0194 


0.0035 
0.0097 
0.0196 


0.0181 
0.0633 
0.0361 


0.0876 
0.1229 
0.0985 


0.0010 
0.0088 
0.0370 


0.0043 
0.0114 
0.0339 


Note: Same as Table 3a. 
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10 assessment occasions, measured every six months, but starting instead from 
age 10. This data design represented the feature of small breadth with a large 
number of observations that covered a late age range. Given the similarity of 
these observed bias findings, we focus next on examining the results for just the 
fifth and sixth simulated data design conditions. 


Table 5a: Data Condition 3 Using Initial Non-Informative Priors — Parameter 
Evaluation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0003? -0.0001 0.3247 -0.0064 0.0054? 0.0016 0.0054* 0.0016 
c? 0.0003^ 0.0002 0.284^ 0.1520 0.0055" 0.0016 0.0055” 0.0016 
0.0002* 0.0001 0.244° 0.1280 0.0055° 0.0016 0.0055° 0.0016 


-0.0003 -0.0078 = = 0.0031 0.0159 0.0031 0.0138 
Ors 0.0002 -0.0050 0.408 -10.008 0.0177 0.0229 0.0177 0.0223 
0.0037 -0.0018 1.846 -0.8820 0.0300 0.0331 0.0297 0.0331 


-0.0017 -0.0008 0.0862 0.0406 0.0143 0.0143 0.0142 0.0143 
Bintercept -0.0031 -0.0025 0.1526 0.1226 0.0253 0.0253 0.0252 0.0252 
-0.0022 -0.0020 0.1080 0.0992 0.0198 0.0198 0.0197 0.0197 


0.0001 0.0001 0.0200 0.0270 0.0034 0.0036 0.0034 0.0036 
Bstope -0.0001 0.0003 -0.0170 0.0780 0.0091 0.0090 0.0091 0.0090 
-0.0006 -0.0001 -0.1530 -0.0270 0.0178 0.0173 0.0178 0.0173 


0.0071 0.0530 3.5620 26.518 0.0195 0.1078 0.0182 0.0938 
v m PE 0.0212 0.0600 3.0354 8.5703 0.0673 0.1161 0.0639 0.0993 
0.0130 0.0563 3.2560 14.078 0.0390 0.1117 0.0367 0.0964 


0.0003 0.0029 2.8400 28.680 0.0011 0.0050 0.0001 0.0041 
O Slope 0.0029 0.0050 2.8960 5.048 0.0095 0.0102 0.0090 0.0088 
0.0108 0.0145 1.2700 3.628 0.0371 0.0317 0.0355 0.0282 


Note: Same as Table 3a. 


The results for the fifth simulated data design condition for the variance- 
covariance matrices W1, W2, and V4 are presented in Table This simulated 
data design condition comprised of observations taken across 3 assessment occa- 
sions, measured every six months, starting from age 4. This simulated condition 
reflected the feature of early age range of development in a small breadth of 
measurement years. The results presented in T able[7a]again correspond to those 
obtained when a sample size of 1000 observations is incorporated as the first 
dataset and then as the last dataset while randomly varying the order of the 
other 5 datasets each with 50 observations. 
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Table 5b: Data Condition 3 Using Initial Informative Priors - Parameter Eval- 
uation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0004? 0.0009 0.43607 0.9120 0.0050? 0.0032 0.0050? 0.0031 
o2 0.0004” 0.0003 0.4080" 0.2720 0.0051? 0.0021 0.0050 0.0021 
0.0005° 0.0002 0.4680° 0.2120 0.0049° 0.0018 0.0049° 0.0018 


0.0000 -0.0006 —— —— 0.0032 0.0119 0.0032 0.0119 
CIS 0.0016 0.0043 3.2080 8.6160 0.0186 0.0244 0.0185 0.0240 
0.0063 0.0066 3.1480 3.3060 0.0310 0.0325 0.0304 0.0318 


0.0002 0.0002 -0.0088 -0.0106 0.0140 0.0146 0.0140 0.0146 
Bintercept -0.0002 -0.0006 0.0078 0.0294 0.0253 0.0255 0.0253 0.0255 
0.0001 -0.0006 -0.0040 0.0276 0.0194 0.0205 0.0194 0.0205 


-0.0002 0.0000 -0.0450 -0.0090 0.0037 0.0039 0.0037 0.0039 
Bstope -0.0003 -0.0004 -0.0830 -0.1000 0.0103 0.0097 0.0103 0.0097 
-0.0006 -0.0013 -0.1570 -0.3240 0.0197 0.0192 0.0197 0.0192 


0.0088 0.0126 4.3800 6.3160 0.0208 0.0703 0.0189 0.0692 
Oe eres 0.0298 0.0169 4.2560 2.4177 0.0726 0.1027 0.0662 0.1013 
0.0177 0.0166 4.4180 4.1420 0.0419 0.0815 0.0380 0.0798 


0.0002 0.0006 1.7600 6.0800 0.0011 0.0038 0.0011 0.0037 
Dope 0.0030 0.0014 3.0040 1.3720 0.0096 0.0098 0.0091 0.0097 
0.0129 0.0091 3.2320 2.2650 0.0382 0.0306 0.0360 0.0293 


Note: Same as Table 3a. 
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Table 6a: Data Condition 4 Using Initial Non-Informative Priors - Parameter 
Evaluation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0004? -0.0002 0.3807 -0.2040 0.0052? 0.0017 0.00527 0.0017 
o2 0.0003 0.0000 0.304P 0.0080 0.0053^ 0.0016 0.0053P 0.0016 
0.0003* -0.0001 0.304° -0.1320 0.0053° 0.0017 0.0053° 0.0017 


-0.0001 -0.0079 z — 0.0035 0.0158 0.0035 0.0137 
CIS 0.0004 -0.0053 0.832 -10.544 0.0179 0.0224 0.0179 0.0218 
0.0048 -0.0022 2.382 -1.1160 0.0291 0.0336 0.0287 0.0335 


-0.0012 -0.0004 0.0596 0.0194 0.0195 0.0191 0.0194 0.0191 
Brntercept -0.0026 -0.0020 0.1318 0.1020 0.0282 0.0285 0.0281 0.0284 
-0.0016 -0.0015 0.0802 0.0728 0.0231 0.0241 0.0230 0.0240 


0.0001 0.0001 0.0130 0.0190 0.0035 0.0034 0.0035 0.0034 
Bstope 9.0000 -0.0003 -0.0050 0.0640 0.0090 0.0090 0.0090 0.0090 
-0.0008 0.0000 -0.1980 0.0040 0.0176 0.0175 0.0175 0.0175 


0.0077 0.0535 3.8260 26.726 0.0225 0.1107 0.0212 0.0969 
ÜTaseredpk 0.0235 0.0605 3.3566 8.6389 0.0677 0.1160 0.0634 0.0989 
0.0144 0.0577 3.6030 14.424 0.0406 0.1138 0.0379 0.0980 


0.0003 0.0024 2.5200 23.760 0.0011 0.0042 0.0011 0.0034 
C Slope 0.0030 0.0044 2.9600 4.384 0.0094 0.0090 0.0089 0.0078 
0.0110 0.0125 2.7420 3.115 0.0355 0.0279 0.0338 0.0250 


Note: Same as Table 3a. 
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Table 6b: Data Condition 4 Using Initial Informative Priors - Parameter Eval- 
uation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0003? 0.0002 0.30007 0.2440 0.00517 0.0019 0.0051? 0.0019 
o2 0.0004” 0.0002 0.3600" 0.2040 0.0051? 0.0021 0.0051? 0.0021 
0.0004* 0.0001 0.3520° 0.0720 0.0053° 0.0018 0.0052° 0.0018 


0.0001 -0.0014  —— —— 0.0036 0.0113 0.0036 0.0112 
CIS 0.0019 0.0025 3.7200 5.0320 0.0181 0.0231 0.0180 0.0230 
0.0069 0.0038 3.4380 1.9080 0.0302 0.0317 0.0295 0.0315 


-0.0006 0.0001 0.0292 -0.0038 0.0202 0.0213 0.0202 0.0213 
Bintercept -0.0010 -0.0006 0.0518 0.0320 0.0291 0.0293 0.0291 0.0293 
-0.0006 -0.0006 0.0302 0.0280 0.0242 0.0264 0.0242 0.0264 


-0.0001 0.0000 -0.0250 -0.0110 0.0038 0.0041 0.0038 0.0041 
Bstope -0.0003 -0.0002 -0.0740 -0.0550 0.0102 0.0101 0.0102 0.0101 
-0.0009 -0.0005 -0.2340 -0.1270 0.0195 0.0197 0.0195 0.0197 


0.0084 0.0162 4.2240 8.1120 0.0228 0.0695 0.0212 0.0675 
Oe irren 0.0304 0.0178 4.3463 2.5463 0.0706 0.0990 0.0637 0.0974 
0.0178 0.0189 4.4480 4.7240 0.0418 0.0784 0.0378 0.0761 


0.0003 0.0007 3.4000 6.6400 0.0012 0.0032 0.0011 0.0031 
Dope 0.0042 0.0021 4.1560 2.0960 0.0104 0.0092 0.0095 0.0089 
0.0165 0.0099 4.1200 2.4740 0.0399 0.0292 0.0363 0.0275 


Note: Same as Table 3a. 
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In general, the obtained values for the raw bias, accuracy, and efficiency cri- 
teria reflect similar results to those observed in the other simulated data design 
conditions. Biased values were observed when comparing the obtained estimate 
to the true population values when applying the Bayesian Synthesis approach 
and regardless of the specified variance-covariance matrix examined. Focusing 
again on the estimated intercept variance (oĵptercept) for the first dataset condi- 
tion in Table [7a] the obtained values for the raw bias (.0069, .0212, and .0125), 
accuracy (.0190, .0643, and .0368), and efficiency (.0177, .0607, .0346) when the 
large dataset was fused first are negligible. However, when the large dataset is 
fused last, the estimated intercept variance values increase by a little for the raw 
bias (.0475, .0546, and .0532), accuracy (.0929, .1059, and .1027), and efficiency 
(.0798, .0907, .0878). 


When examining the relative bias criterion under the three variance-covariance 
matrixes, some visible biased patterns of results again emerge. In this data de- 
sign, it can again be seen that the relative bias for the estimated intercept 
variance (ae increases from ignorable sizes (3.472%, 3.033%, and 3.116%, 
respectively) when the large dataset was fused first into moderately biased and 
substantially biased values (23.726%, 7.793%, and 13.330%, respectively) when 
the large dataset was fused last. Interestingly, the estimates of the slope variance 
(o Sio») showed rather different patterns of results, with sizable magnitudes of 
relative bias both when fusing the larger sample dataset first and last. In par- 
ticular, the slope variances displayed substantial bias (18.480906) for covariance 
matrix (Yı) and ignorable bias (4.320% and 3.574%) for covariance matrixes 
(V5 and V3) when the large dataset was fused first, compared to substantially 
biased values (18.2896) for the small magnitude covariance matrix (V1) to moder- 
ately biased (6.35%) for medium magnitude covariance matrix (W2) to ignorable 
(4.42%) for the large magnitude covariance matrix (W3) when the larger dataset 
was fused last. A sizable amount of relative bias was also observed when ex- 
amining the magnitude of the intercept slope covariance value (arg ) for the 
W, variance-covariance matrix, shifting from ignorable bias (-2.104%) when the 
large dataset was fused first to substantially biased (-15.112%) when the larger 
dataset was fused last. This similar pattern was also observed in Table |7b] when 
initial informative priors were used. Specifically, moderate bias for the intercept 
variance (07,,,,,,,,) for the first covariance matrix (V1) condition was observed 
both when the large sample dataset was fused first and last. In contrast, moder- 
ate bias for the estimate of the slope variance (oSiope) was only observed when 
the large dataset was fused last (6.385590). 


Table [Ba] presents the results for the final sixth simulated data design condi- 
tion for the variance-covariance matrices V4, V5, and V3. This simulated data 
design condition comprised of observations collected across 3 assessment oc- 
casions, measured every year, starting from age 11. This simulated condition 
reflected the feature of a late age range of development in a small breadth of 
measurement years. The obtained values for the raw bias, accuracy, and effi- 
ciency criteria reflect similar results to those observed in other simulated data 
design conditions. Focusing again on the estimated intercept variance (O ntercept) 
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Table 7a: Data Condition 5 Using Initial Non-Informative Priors — Parameter 
Evaluation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.00017 -0.0005 -0.1127 -0.5160 0.0053° 0.0032 0.00537 0.0032 
o2 0.0001? -0.0005 -0.052^ -0.4640 0.0052^ 0.0035 0.0052” 0.0035 
0.0000* -0.0001 -0.036° -1.0200 0.0052° 0.0040 0.0052° 0.0039 


-0.0023 -0.0073  —— mE 0.0049 0.0142 0.0035 0.0121 
CIS -0.0011 -0.0076 -2.104 -15.112 0.0156 0.0197 0.0156 0.0182 
0.0008 -0.0079 0.382 -3.9400 0.0251 0.0262 0.0251 0.0250 


-0.0008 -0.0006 0.0378 0.0312 0.0164 0.0152 0.0163 0.0152 
Intercept -0.0011 -0.0010 0.0538 0.4840 0.0263 0.0264 0.0263 0.0264 
-0.0015 -0.0010 0.0764 0.0482 0.0208 0.0214 0.0208 0.0213 


0.0009 0.0009 0.0232 0.2320 0.0062 0.0065 0.0062 0.0064 
Bstope 9.0015 -0.0020 0.3810 0.4960 0.0131 0.0132 0.0130 0.0131 
0.0017 0.0028 0.4170 0.7050 0.0207 0.0211 0.0207 0.0209 


0.0069 0.0475 3.4720 23.726 0.0190 0.0929 0.0177 0.0798 
Oinievadpe 0.0212 0.0546 3.0331 7.7931 0.0643 0.1059 0.0607 0.0907 
0.0125 0.0532 3.1160 13.300 0.0368 0.1027 0.0346 0.0878 


0.0018 0.0018 18.480 18.280 0.0029 0.0029 0.0023 0.0023 
C Slope 0.0043 0.0064 4.3200 6.353 0.0100 0.0114 0.0090 0.0095 
0.0143 0.0177 3.5740 4.416 0.0367 0.0344 0.0338 0.0295 


Note: Same as Table 3a. 
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Table 7b: Data Condition 5 Using Initial Informative Priors - Parameter Eval- 
uation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0008? 0.0003 0.82737 0.3052 0.00597 0.0037 0.0059? 0.0037 
o2 0.0001? -0.0001 0.1360” -0.0920 0.0046^ 0.0035 0.0046P 0.0035 
0.0001* 0.0002 0.0760° 0.2000 0.0046° 0.0043 0.0046° 0.0043 


0.0003 -0.0016  —— mE 0.0058 0.0110 0.0059 0.0109 
CIS -0.0007 -0.0012 -1.4160 -2.3200 0.0161 0.0192 0.0161 0.0191 
0.0020 -0.0009 1.0160 -0.4480 0.0257 0.0293 0.0256 0.0293 


0.0016 0.0018 -0.0795 -0.0878 0.0165 0.0155 0.0165 0.0154 
Bintercept 9.0025 0.0026 -0.1234 -0.1288 0.0267 0.0261 0.0266 0.0259 
0.0020 0.0016 -0.1000 -0.0802 0.0211 0.0217 0.0210 0.0216 


-0.0002 -0.0001 -0.0562 -0.0351 0.0068 0.0074 0.0068 0.0074 
Bstope 9.0001 0.0001 0.0320 0.0310 0.0140 0.0137 0.0140 0.0137 
0.0007 0.0002 0.1870 0.0600 0.0213 0.0220 0.0213 0.0220 


0.0118 0.0127 5.8976 6.3394 0.0235 0.0613 0.0204 0.0601 
Te A 0.0267 0.0140 3.8080 2.0000 0.0632 0.0824 0.0573 0.0812 
0.0161 0.0144 4.0240 3.5950 0.0368 0.0705 0.0331 0.0690 


0.0000 0.0006 0.0402 6.3855 0.0028 0.0022 0.0029 0.0022 
Dope 0.0028 0.0029 2.8080 2.9080 0.0095 0.0099 0.0091 0.0095 
0.0105 0.0091 2.6280 2.2650 0.0361 0.0318 0.0345 0.0305 


Note: Same as Table 3a. 
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in Table [Ba] the same patterns of results are evident. For instance, the obtained 
values for the raw bias (.0038, .0214, and .0251), accuracy (.0552, .0972, and 
.0613), and efficiency (.0399, .0762, .0559) when the large dataset was fused first 
are negligible. Similarly, when the large dataset is fused last, the estimated inter- 
cept variance values again slightly increase (though are still rather close to zero) 
for the raw bias (.0145, .0335, and .0206), accuracy (.0357, .0808, and .0529), 
and efficiency (.0326, .0735, .0487). 

When carefully examining the relative bias criterion under the three variance- 
covariance matrixes, some novel biased results emerge. Specifically, it can be seen 
in Table Ba that the relative bias for the estimated intercept variance (a NE 
ranges from substantial to ignorable to moderately biased (19.020%, 3.054%, 
and 6.286%, respectively) when the large dataset was fused first, and similarly 
(7.226%, 4.785%, and 5.162%, respectively) when the large dataset was fused 
last. The estimates of the slope variance (o Suspe) also showed somewhat similar 
patterns of results, with sizable magnitudes of relative bias both when fusing the 
larger sample dataset first and last. The slope variances also displayed ranges 
from substantial bias to ignorable bias (13.20%, 4.63%, and 4.04%, respectively) 
when the large dataset was fused first, compared to (15.52%, 2.9396, and 1.66%, 
respectively) when the larger dataset was fused last. A sizable amount of relative 
bias was also observed when examining the magnitude of the intercept slope 
covariance value (org ) for the V? variance-covariance matrix, shifting from 
ignorable bias (-2.01696) when the large dataset was fused first to substantially 
biased (-13.76%) when the larger dataset was fused last. Interestingly, this was 
the only data design condition where moderate bias was observed when the large 
dataset was analyzed first and informative priors were used to analyze the first 
dataset. In table[Bb] we see that there was moderate bias (7.147896 and 5.546696 
respectively) for the estimated intercept (07,,,,,,,;) and slope (o$,,,,) variance 
for the first variance-covariance matrix condition (V). 

'These results appear to collectively highlight not only the importance that 
the order that fusing the datasets can play in the estimation of parameters but 
also the potential impact that data design characteristics can exert when imple- 
menting Bayesian synthesis strategies. It appears that in data design settings 
with fewer occasions of measurement covering wide range of ages, there can be 
sizable bias irrespective of whether a larger data set is fused first or last. Addi- 
tionally, even in instances where there are sufficient assessment settings over a 
wider age range, the order of the fusing of the data sets can again play a key 
role. These results would collectively suggest that the order in which datasets 
are incorporated in the Bayesian Synthesis process do in fact impact the results 
when one dataset is substantially larger than the rest. 
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Table 8a: Data Condition 6 Using Initial Non-Informative Priors — Parameter 
Evaluation Criteria Results 
Parameter Raw Bias Relative Bias Accuracy Efficiency 
FIRST LAST FIRST LAST FIRST LAST FIRST LAST 
0.0009? -0.0003 -0.888 -0.3040 0.0051? 0.0032 0.00507 0.0032 
o2 0.0002? -0.0004 -0.024^ -0.3920 0.0051^ 0.0034 0.0051? 0.0034 
0.0004° -0.0002 -0.408° -0.2440 0.0052° 0.0033 0.0052* 0.0033 


-0.0058 -0.0047 | —— —— 0.0083 0.0084 0.0060 0.0069 
CIS -0.0010 -0.0069 -2.016 -13.760 0.0170 0.0168 0.0169 0.0153 
-0.0018 -0.0064 -0.876 -3.2200 0.0266 0.0241 0.0265 0.0232 


-0.0004 -0.0007 -0.0212 0.0352 0.0308 0.0287 0.0308 0.0286 
Bintercept -0.0006 -0.0010 0.0288 0.0490 0.0423 0.0425 0.0423 0.0425 
-0.0014 -0.0050 0.0696 0.0246 0.0342 0.0369 0.0341 0.0369 


0.0006 0.0003 0.1440 0.0670 0.0061 0.0045 0.0060 0.0045 
Bstope 9.0010 -0.0090 0.3520 0.2300 0.0110 0.0105 0.0109 0.0105 
0.0014 0.0019 0.4170 0.4700 0.0189 0.0190 0.0189 0.0189 


0.0038 0.0145 19.020 7.2260 0.0552 0.0357 0.0399 0.0326 
ÜTaseradpt 0.0214 0.0335 3.0543 4.7851 0.0792 0.0808 0.0762 0.0735 
0.0251 0.0206 6.2860 5.1620 0.0613 0.0529 0.0559 0.0487 


0.0013 0.0016 13.200 15.520 0.0019 0.0027 0.0013 0.0022 
Goie 0.0046 0.0029 4.6320 2.936 0.0094 0.0060 0.0082 0.0052 
0.0162 0.0066 4.0440 1.662 0.0346 0.0187 0.0305 0.0175 


Note: Same as Table 3a. 
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uation Criteria Results 


Parameter 


Raw Bias 


Relative Bias 


Accuracy 


Efficiency 


OIS 


Bintercept 


Bstope 


2 
O Intercept 


2 
C Slope 


FIRST LAST 


FIRST LAST 


FIRST LAST 


FIRST LAST 


0.0006? -0.0002 
0.0003” -0.0005 
0.0007* -0.0004 


-0.0012 -0.0005 
0.0007 -0.0017 
0.0041 0.0010 


0.0031 
0.0034 
0.0027 


0.0022 
0.0025 
0.0001 


-0.0002 
0.0003 
0.0008 


0.0000 
0.0002 
0.0002 


0.0143 
0.0151 
0.0069 


0.0042 
0.0188 
0.0114 


0.0006 
0.0030 
0.0112 


0.0002 
0.0012 
0.0035 


0.6032? -0.1960 
0.2600P -0.4920 
0.6960° 0.3800 


1.3200 
2.0560 


-3.3200 
0.5020 


-0.1547 -0.1094 
-0.1680 -0.1234 
-0.1326 -0.0050 


0.0070 
0.0420 
0.0610 


-0.0435 
0.0700 
0.2080 


7.1478 2.0960 
2.1503 2.6880 
1.7160 2.8560 


5.5466 2.2000 
3.0160 1.1520 
2.8040 0.8810 


0.0059" 0.0031 
0.0053” 0.0031 
0.0056° 0.0030 


0.0086 
0.0181 
0.0281 


0.0078 
0.0156 
0.0238 


0.0361 
0.0471 
0.0374 


0.0306 
0.0430 
0.0384 


0.0065 
0.0111 
0.0190 


0.0047 
0.0105 
0.0194 


0.0523 
0.0846 
0.0691 


0.0375 
0.0719 
0.0498 


0.0016 
0.0091 
0.0355 


0.0020 
0.0061 
0.0192 


0.0059? 0.0031 
0.0053” 0.0031 
0.0055° 0.0030 


0.0086 
0.0181 
0.0278 


0.0078 
0.0155 
0.0238 


0.0362 
0.0470 
0.0373 


0.0305 
0.0430 
0.0384 


0.0066 
0.0111 
0.0190 


0.0047 
0.0105 
0.0194 


0.0506 
0.0832 
0.0688 


0.0373 
0.0694 
0.0484 


0.0015 
0.0086 
0.0336 


0.0020 
0.0060 
0.0189 


Note: Same as Table 3a. 
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4 Discussion 


Bayesian estimation operates by using prior information about the characteris- 
tics of parameters and the conditional likelihood of the data given the model 
parameters to arrive at a posterior distribution. The Bayesian Synthesis ap- 
proach is based on this Bayesian estimation framework in which information 
obtained from one dataset serves to provide prior information for the analysis of 
the next dataset and this process continues sequentially until a single posterior 
distribution is created using all available datasets. While the benefits of using 
fused datasets have been repeatedly demonstrated in the literature (e.g., 
2017), and the estimates computed via a sequentially ob- 


tained final posterior distribution like those in the Bayesian Synthesis approach 
have also been shown to effectively aid in the accuracy of the estimation process 
[2020] |Marcoulides| 20170), what had not be determined was whether 
the order in which the data are sequentially analyzed has an impact on the 
obtained results. 

'The commonly accepted view in Bayesian estimation is that the order in 
which the data are analyzed should not be a concern due to the exchangeability 
assumption [1974]. Nevertheless, because Bayesian Synthesis 
utilizes point summary estimates of the posterior distributions instead of the 
full posterior distribution as required in standard Bayesian estimation, it pos- 
sible that using point summary estimates of the posterior distributions may 
conceivably introduce some bias in the parameter estimates. Although past re- 
search has confirmed that the order of analysis does not meaningfully impact the 
final data fusion results obtained via Bayesian Synthesis 20175), 
these conclusions were determined on the basis of analyzing datasets that were 
from similarly-sized and large samples. What was unresolved in the literature 
is whether exchangeability matters when the datasets being fused have sub- 
stantially different sample sizes, as regularly occurs in empirical settings. Does 
beginning or ending the Bayesian Synthesis approach with the analysis of a large 
dataset produce a biased final posterior distribution when the other sequentially 
analyzed datasets are much smaller? This study examined via simulation the im- 
pact that the ordering of datasets might have on parameter estimates obtained 
when making use of the Bayesian Synthesis process in such data fusion design 
settings. 

'The results of the simulation study collectively highlighted the importance 
that the ordering of datasets can have on the estimation of growth model pa- 
rameters when using the Bayesian Synthesis process. When the datasets being 
fused are of markedly different and much smaller sizes, ending the fusion and 
Bayesian estimation based on a large dataset produces a substantially biased 
(according to the measure of relative bias) final posterior distribution, partic- 
ularly for the intercept and slope variance. Dissimilar longitudinal data design 
characteristics were also sometimes found to produce substantially biased final 
posterior distribution when implementing Bayesian Synthesis strategies. In lon- 
gitudinal data design settings with fewer occasions of measurement and covering 
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varying ranges of ages, sizeable biased estimates were observed irrespective of 
whether a larger data set is fused first or last. Even in instances where there were 
numerous assessment occasions over a wider age range, the order of the fusing 
of the data sets still played a key role in the estimation, primarily with more siz- 
able bias present when the large dataset was fused last. T'he results revealed that 
the order datasets of differing size are incorporated into the Bayesian Synthesis 
process along with the data design characteristics can impact the resulting pa- 
rameter estimates and clearly calls into question the previously accepted notion 
of exchangeability of Bayesian estimation within the Bayesian Synthesis process. 


Researchers planning on using the Bayesian Synthesis approach to data fu- 
sion should therefore be very careful how they elect to begin and end planned 
data fusion activities, especially in instances that involve the analyses of sub- 
stantially large datasets among other much smaller datasets. Because Bayesian 
Synthesis uses point summary estimates from the analysis of one dataset as 
priors for the analysis of the next dataset, it is likely that this can introduce 
some bias in the Bayesian estimation. One explanation for this bias is that when 
the small datasets are being incorporated first, the informative priors that re- 
sult from these small datasets are less reliable (contain sizable bias) and that 
this bias is then inevitably carried over to subsequent samples, resulting in a 
more biased final posterior distribution. Although it is commonly accepted that 
sample size can play an important role in the estimation of parameters, it is 
unclear in this context how much smaller the datasets can be relative to the 
other datasets. In this study, it was unmistakably determined that fusing a large 
dataset with smaller ones biased many of the parameter estimates provided by 
Bayesian Synthesis (particularly when measured by the relative bias criterion). 
But it is unclear what the ideal sample size needs to be in order to be used in the 
approach and ensure sufficiently stable parameter estimates. The current study 
fixed some of the data design characteristics in order to keep the scope of the 
work manageable. Given that a major benefit of Bayesian Synthesis is that data 
from multiple sources can be analyzed to obtain estimates of overall effects, ex- 
amining other data size conditions under which this approach does not operate 
well is a natural extension to the current study. There is overall agreement among 
researchers that larger samples provide more stable estimates, but must all the 
fused datasets meet this requirement in order for Bayesian estimation exchange- 
ability to hold? Although the current results indicated that exchangeability did 
not always hold in the examined data design scenarios involving growth curve 
models where there were differences in the size of the samples, the number of 
measurement occasions, time of first assessment and between assessments, as 
well as magnitude of the intercept and slope variances and covariances, it is of 
course possible that when modeling other statistical paradigms that different 
results may be observed. Empirical applications of the Bayesian Synthesis ap- 
proach must make certain that the data fusion activities will provide researchers 
with unbiased parameter estimates. 


Without doubt prior specification may be the largest advantage, yet poten- 
tially the greatest drawback, of implementing Bayesian methods in Bayesian 
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Synthesis. Priors enable researchers to include information from different data 
sources in a systematic manner. While it is recognized that imposing informative 
priors improves parameter estimates, especially with small sample sizes (Depaoli 


2014} |Little||2006), given that the true prior distribution is unknown in practice, 


researchers must be cautious about the impact that inaccurate priors have on 
parameter estimation in Bayesian Synthesis 2018). We also exam- 
ined the impact of the order of incorporation in the Bayesian Synthesis process 
using initial informative data-dependent priors to analyze the first dataset in the 
data fusion process. Across the various design conditions, moderate and substan- 
tial bias was primarily found when the large dataset was analyzed last. These 
results are consistent with those found when using initial non-informative or dif- 
fuse priors to analyze the first dataset in the data fusion process. Future research 
studies should therefore expand further on our findings and examine additional 
data design conditions and settings. 

'The process of sequentially updating information to arrive at conclusions un- 
deniably has a substantiated place in data analyses and Bayesian Synthesis can 
play a key role in helping researchers address questions not always achievable 
with a single study. Although additional research needs to be done regarding 
when Bayesian Synthesis is most useful and when it might prove to be prob- 
lematic, the foundations for the continued use of this data fusion process are 
evident. We caution researchers to remain mindful of the limitations identified 
in this study when integrating data from different sources. 
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Abstract. Algorithms play an increasingly important role in public pol- 
icy decision-making. Despite this consequential role, little effort has been 
made to evaluate the extent to which people trust algorithms in decision- 
making, much less the personality characteristics associated with higher 
levels of trust. Such evaluations inform the widespread adoption and effi- 
cacy of algorithms in public policy decision-making. We explore the role 
of major personality inventories — need for cognition, need to evaluate, 
the “Big 5" — in shaping an individual's trust in public policy algorithms, 
specifically dealing with criminal justice sentencing. To explore person- 
ality in this context, we fielded an original survey experiment aimed at 
assessing the impact of varying advice sources on forecasting criminal re- 
cidivism, conditioned by personality traits. We found strong correlations 
between all personality types and general levels of trust in automation, 
as expected. Further, we uncovered evidence that need for cognition in- 
creases the weight given to advice from an algorithm relative to humans, 
and “agreeableness” decreases the distance between respondents’ expec- 
tations and advice from a judge, relative to advice from a crowd. 


Keywords: Personality - Trust in automation - Public policy - Decision- 
making 


1 Introduction 


Algorithms are increasingly important in public policy implementation 
2022). Algorithms assist officials in major US cities to al- 
locate resources (O'Brien| [2015], judges in detecting gerrymandering 
(2017), and the military to control weapons [2018]. Recently, 


algorithms have also begun to play a role in criminal sentencing, where algo- 
rithms are used by judges to inform expectations on a defendant's probability 


of recidivating (Waggoner & Macmillen| |2021). Such a hybrid-decision making 


process between humans and algorithms influences the parameters, duration, 


and severity of sentencing (Dressel & Farid} 2018). 
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Despite the rise in interest about automation and algorithms, little atten- 
tion has been paid in public policy to algorithms or the psychological factors 
that influence trust in them. explored situations under which 
people approve development of autonomous weapons systems, but this reveals 
little about the underlying trust people have in algorithms in practice. Further, 
some have debated whether individuals place low levels of trust in algorithms, 
"algorithm aversion" (Dietvorst, Simmons, & Massey [2015], or high levels of 
trust, “algorithm bias” (Logg| 2016). Still, very little attention has been paid to 
how individuals’ psychological characteristics might influence attitudes towards 
algorithms. Instead, the literature tends to focus on demographic or cultural 

'To address this gap, we recently fielded a criminal sentencing survey experi- 
ment and leveraged three major inventories of psychological measures of person- 
ality to explore who is more or less trusting of algorithms: “need for cognition” 


(NC) (Cacioppo & Petty] [1982), “need to evaluate” (NE) (Bizer et al.| [2004], 
and the “Big 5” (Norman| [1963). 


'The survey experiment was primarily interested in assessing the impact of 
varying advice sources (judge, algorithm, a "crowd" of peers) on respondents' 
forecasts of criminal recidivism. Of primary interest was the conditioning role 
of personality in this forecasting effort. The details of and findings from the 
experiment are detailed throughout the remainder of the paper. 


1.1 Personality Inventories 


The first inventory, need for cognition ( see Pet , is A Tum individuals who 
have a strong desire to learn and grow (Cacioppo & Petty! . Some previous 
studies on NC in similar contexts have cacr that Mo ish NC individuals 
are asked to undertake a task in which they are given little information and 
then provided expert advice, they are more likely to assign greater weight to 
that advice, rather than relying on heuristics 
[2004]. This suggests that high NC individuals will be more likely to take 
advice insofar as they view that advice as “expert,” given their more elaborate 


processing of information (Sicilia, Ruiz, & Munuera| 2005). 


Our second, need to evaluate (NE), is associated with individuals who tend to 
generate and retain their own attitudes 2004). This “self-monitoring” 
personality is also associated with the need to control (Snyder| and con- 
stantly evaluate social surroundings 1996). Such attributes in- 
duce greater reliance on intuition over outside sources. Past work has demon- 
strated that high NE individuals tend to make spontaneous judgements in re- 
sponse to stimuli [2001]. This “on-line” form of information 
processing suggests that when people come into contact with outside informa- 
tion, their personality plays a key role in determining their levels of acceptance 
of the information. Resultant attitudes are much stronger than those in the al- 
ternative, “memory-based” processing (Bizer, Tormala, Rucker, & Petty} |2006). 
Other studies have also leveraged NE to explain information processing (Druck-| 


man & Nelson, |2003). For our context, we expect high NE individuals exhibit 
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greater reliance on their own intuition compared to other sources, which should 
lead to distrust. When a high NE individual is confronted by advice from an out- 
side source, we expect these individuals to be less trusting of advice, regardless 
of its origin. This is in line with recent work suggesting that errors experienced 
from an algorithm provoke a stronger distrust of that advice than do errors 


experienced from other sources (Dietvorst et al.||2015). 


For measurement, we used the two-item battery for each personality type (NC 
and NE), totaling four questions for both personality types. Question wording is 
in the Appendix. Of note, though there is a tradeoff between "internal reliabil- 
ity and brevity” [2011], we opted for the 
smaller battery for two reasons. First, it is the exact same approach as using the 
common, reliable TIPI inventory to measure each of the Big 5 traits 
[2003]. Both approaches uses two items per personality 
trait to generate a measure. Second, we wanted to ensure high response rates, 
given the inclusion of the personality batteries in addition to our main experi- 
ment. To minimize the burden on the respondent and with the “the benefit of 
being short enough to be included in large political surveys," 
268), we opted for the smaller battery. Ultimately, we selected these measures 
of personality given their widespread use in a variety of fields including politi- 
cal science (Jost, Glaser, Kruglanski, & Sulloway] 003), public policy (Sargent] 
2004), psychology (Cacioppo, Petty, & Feng Kao] [1984), and others (Luttrell, 
poi). 


Turning now to the Big 5, we use only the “agreeableness” and “openness 
to experience” traits in our study as they can be most clearly linked to trust in 
automation. We selected only two instead of all five traits, because, as 
note, “in most cases only some of the Big Five traits significantly 
predict outcomes of interest” (268). Our approach is similar to other studies on 
the role of the Big 5 in behavior that select only the specific personality traits 


that can be clearly linked to substantive phenomena (Quintelier| 2014). 
For agreeableness, (2011) and |John and Srivastava] (1999) note 


that “agreeableness contrasts a prosocial and communal orientation toward oth- 
ers with antagonism and includes traits such as altruism, tender-mindedness, 
trust, and modesty.” Agreeableness is also associated with social conformity 
and compliance 1981). In our con- 
text, being given advice from an “expert,” and then asked whether they wish 
to update their expectation, we expect agreeable individuals should positively 
respond to the advice-giver, regardless of the source of advice. In an effort to 
conform to the reigning wisdom via the advice treatment, individuals who are 
high on agreeableness should trust automation, positively weight expert advice, 
and also align with the advice-giver. 


Second, openness is associated with originality (Gerber et al. 


1999), intellectual curiosity (Peabody & Goldberg| |1989), and an 
eagerness to learn (Barrick & Mount} 1991). As individuals who are open to 


experiences come into contact with outside advice in an unfamiliar realm, they 
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should positively respond to the advice treatment across all three measures of 
trust discussed below. 

We leveraged the Ten-Item Personality Inventory (TIPI) (Gosling et al. 
to measure these traits. Two items containing personality adjectives are 
associated with each trait, with one phrase coded normally and the other re- 
verse coded (e.g., for “agreeableness”: item 7 = sympathetic, warm and item 2 
(reverse coded) — critical, quarrelsome). 


2 Method 


2.1 Participants 


We utilized Amazon’s Mechanical Turk (MTurk) to recruit 395 subjects, each 
of whom were paid $2.00 for participation. MTurk is a valid, widely used plat- 
form to field similar political, psychological, and social experiments such as ours 


(Clifford, Jewell, & Waggoner| 2015). Additional details of the study design are 


included in the Appendix. 


2.2 Procedure 


Our study contains observational (general trust) and experimental (behavioral 
impact) components. For the observational component, respondents were given 
an eight-item battery of questions related to degrees of trust in automation 
[2008]. Respondents were asked their level of agreement 
on a scale from 1 (strongly disagree) to 7 (strongly agree) for statements like, 
“Using algorithms improves the output quality for organizations." These were 
aggregated into a 7 point scale where 7 indicates high trust in algorithms, while 1 
indicates low trust. The wording for all of the items is available in the supporting 
information. This scale is the dependent variable for the first stage of the analysis, 
which is analyzed using OLS regression and presented in Table [I] 

For the experimental component, respondents were asked to forecast the 
probability of a defendant committing another crime within two years for one 
of eight real, randomly selected criminal profiles based on criminal history and 
defendant demographic characteristics. Then, the respondent was given “advice” 
from a source (listed below), and asked whether they wanted to update predic- 
tions or leave them the same (manual entry required both times). The shifts in 
respondents' predictions (or lack thereof) is the quantity of interest in our study. 
We included two attention checks throughout to minimize satisficing 
[2016]. Specifically, respondents were warned if they missed one atten- 
tion check, and then were removed and not paid if they failed both. About 8096 
of respondents who attempted the survey passed the checks and completed the 
survey. 

'The presentation of our criminal profiles mimics the formatting of 


(2018), which was shown to be a sufficient amount of detail for an av- 


erage M'Turk participant to make an informed judgment, with expected accuracy 
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similar to the popular “COMPAS” algorithm. The full wording is available in the 
supporting information. We randomly selected 20 pre-trial defendants from the 
2013-14 from Broward County, FL database, who all had a risk scores between 
2 and 8 (derived from the COMPAS algorithm, which ranked defendants from 
1 to 10, with 10 being the most likely to recidivate). This pool of defendants 
was winnowed when the crime involved was obscure, and then reduced again 
randomly to reduce the task burden on respondents, which left us with eight 
profiles. 


For each profile, respondents were randomly assigned to one of the three ad- 
vice conditions: judge with 10 years of experience in criminal sentencing; com- 
puter algorithm designed by computer scientists and criminal justice officials; 
average of a previous survey of 300 Turkers. The treatment conditions are coded 
as separate dummy variables for whether the individual saw advice from an algo- 
rithm or a judge in the scenario, with the previous MTurk survey as the baseline 
condition. And, in addition to the main personality predictors, we control for 
several common factors in public policy experiments, including age, education, 
gender, and partisanship. 


We evaluate two measures as trust. The first measure is “weight of advice” 
(Gino & Moore) por [Logg] 2016| . This variable is calculated as | uo; — uii | / | 
i — ui; |, where uo; is eer i's final assigned probability for recidivism, 
uj; is their initial prediction, and a; is the advice they were given from one of 
the sources. A score of 1 suggests the respondent only used the advice from 
the source, where as 0.5 suggests they weighted the source and their prediction 
equally, and 0 means the respondent ignored the advice. Our second measure is 
the average distance to advice, measured as | a;— uri |. Lower values indicate that 
there was less distance between the respondent's final forecast and the advice 

they were given. 


We modeled the weight and distance measures by fitting multilevel regres- 
sions to the data after pooling across all criminal profiles and specifying varying 
intercepts for defendant descriptions and respondent. Multilevel models were 
chosen to account for unobserved heterogeneity on both the individual respon- 
dent and scenario level. This provides an efficient and accurate estimates for 
experiments where respondents evaluate multiple, different scenarios 


& Hilli |2006). The model was specified as 
Yigk = Qijk t Cj + Óx +B * X + eijk (1) 


where aj; is the overall intercept, ¢; ~ N(0, 1) is the random intercept based on 
the defendant description, x ~ N(0,1) is the random intercept based on the 
individual respondent, f is an array of coefficients for the treatments X, and eij 
is the error term. Results are presented in Table [2] 
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3 Results 


For the observational analysis, note the significance and large magnitudes of 
effects for all personality indicators in the top four rows of Table High NE 
respondents are less likely to trust algorithms (f — —0.14), compared to those 
higher on NC (6 = 0.25), agreeableness (8 = 0.04), and openness (8 = 0.07), all 
of whom are eager to learn. The latter group of respondents is more trusting of 
automation in line with expectations. 


Table 1. The Impact of Personality on Trust in Automation 


Dependent variable: 


Trust in Automation 


(1) (2) 
Need for Cognition 0.245*** (0.023) 
Need to Evaluate —0.136*** (0.021) 
Agreeableness 0.040** (0.016) 
Openness to Experience 0.066*** (0.015) 
Age 0.005*** (0.002) — 0.007*** (0.002) 
Education 0.131*** (0.029) —0.020 (0.040) 
Female —0.212*** (0.035) —0.336*** (0.040) 
Partisanship —0.047*** (0.009) —0.028*** (0.010) 
Algorithm Condition  —0.000 (0.00000) —0.000 (0.00000) 
Judge Condition —0.000 (0.00000) 0.000 (0.00000) 
Constant 3.783*** (0.124) — 4.047*** (0.173) 
N 3,022 2,233 
Log Likelihood 32,311.620 24,075.100 
Akaike Inf. Crit. —64,599.240 —48,126.190 
Bayesian Inf. Crit. —64,527.070 —48,057.660 


Note: *p<0.1; **p<0.05; ***p<0.01 


The experimental stage exploring the impact of personality on behavioral 
tasks seen in Table [2] and Figure [I] Here our N is higher because each respon- 
dent evaluated 8 defendant profiles. NC plays a strong conditioning role in the 
relative weight respondents’ assign to advice across both “expert” conditions in 
comparison to the baseline category. The degree to which NC conditions trust 
in algorithms is nearly doubled that of the judge condition (8 = 0.09 com- 
pared to 6 = 0.05). Further, the weight effect is opposite for NE individuals 
in the algorithm condition (8 = —0.04), and indistinguishable from zero in the 


! Of note, the trust in automation index is measured at the individual-level, not the 
scenario-level. Hence the larger N in the tables, relative to the number of individual 
recruited subjects. 
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judge condition. Results are similar for high openness personality types in their 
weighting of algorithmic advice relative to humans. 


Table 2. The Impact of Personality on Behavior 


Dependent variable: 
Weight Distance Weight Distance 
(1) (2) (3) (4) 
Need for Cognition —0.047*** (0.018)  1.687* (0.892) 


Need to Evaluate 0.012 (0.017) —0.964 (0.839) 

Agreeableness 0.018 (0.013) | —0.501 (0.642) 
Openness —0.017 (0.012) —0.126 (0.606) 
Age 0.0002 (0.001) ^ 0.008 (0.043) — 0.001 (0.001) —0.012 (0.051) 
Education 0.010 (0.019) — —1.879** (0.857) 0.010 (0.022) —0.571 (1.008) 
Female 0.015 (0.020) ^ —0.678 (0.930) —0.001 (0.025) 0.758 (1.111) 
Partisanship 0.004 (0.005) ^ —0.187 (0.231) 0.005 (0.006) —0.128 (0.279) 
Algorithm Cond. ^ —0.072 (0.079) —1.701 (4.101) 0.120 (0.097) —9.706* (5.081) 
Judge Cond. 0.006 (0.083) ^ —1.969 (4.307) —0.030 (0.097) — 1.395 (5.144) 
Alg. x NC 0.093*** (0.024) —3.007** (1.251) 

Alg. x NE —0.035* (0.021) — 1.905* (1.106) 

Judge x NC 0.054** (0.022) — —1.929* (1.168) 

Judge x NE —0.033 (0.022) 1.645 (1.140) 

Alg. x Agreeable —0.024 (0.016) 0.610 (0.851) 
Alg. x Openness 0.028* (0.015) — 0.228 (0.801) 
Judge x Agreeable 0.031* (0.016) —2.216** (0.876) 
Judge x Openness —0.005 (0.016) — 1.275 (0.854) 
Constant 0.225** (0.093) 30.664*** (6.182) 0.076 (0.115) 32.606*** (7.002) 
N 3,022 3,022 2.233 2.233 

Log Likelihood —403.901 —12,702.400 —349.032 —9,388.203 
Akaike Inf. Crit. 839.802 25,436.800 730.063 18,808.410 
Bayesian Inf. Crit. 936.021 25,533.020 821.441 18,899.780 


*p«0.1; **p«0.05; ***p«0.01 


Notably, NC strongly conditions trust in algorithms, but less so compared to 
advice from crowds or human experts. This is seen most clearly when comparing 
panels (a) and (b) in Figure[1] The gaps between the fit lines are more distinct 
in the algorithm condition (a) compared to the judge condition (b). There is 
only a modest distinction at the tails in the judge condition. 


Across all treatment conditions, the advice given by all three sources was the 
same, and we found no differences when we presented values centered around 
those derived from the COMPAS algorithm or when the advice was randomly 
chosen. 


Of note, for the weight and distance multilevel models, to test if there was 
any impact of varying the treatments by scenario or by respondent, we ran- 
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domly allocated half of our respondents to each type of assignment. We found 
no difference (i.e., no detection of study purpose) in the results. 


Conditional Effect of NC & Algorithm Condition Conditional Effect of NC & Judge Condition 

0.64 0.64 

0.44 0.44 
o o 
RS] oO 
3 Algorithm 3 Judge 
E Be 3 B 
z EN . E = : 
5 BE S = 
z = 

0.24 0.24 

0.04 0.04 

1 2 3 4 5 1 2 3 4 5 
Need for Cognition Need for Cognition 


Figure 1. Conditional Impacts of NfC on Behavior 


4 Discussion 


Overall, we found that personality influences trust in automation, as well as be- 
havioral tasks related to public policy decision-making. In the first stage, there 
was a pronounced impact of personality on general levels of trust. In line with 
research finding high levels of trust in algorithms 
2011), the significant conditioning role of these personality inventories suggests 
that personalities associated with intellectual curiosity, agreeableness, openness 
to advice-givers, as well as being highly aware of environments and more skepti- 
cal are strongly associated with levels of trust in automation. The former group 
comprised of individuals who are more accepting of new information and expe- 
riences is more trusting, while the latter group, who tends to be threatened by 
exogenous sources of information, is less trusting. 

Regarding changes in respondents’ behavioral indicators of trust, high NC in- 
dividuals are much more trusting of algorithms than of the wisdom of the crowd 
or, to a lesser extent, a human expert. And the NE personality trait, which we 
expected to be threatened by exogenous advice, weighs advice from algorithms 
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less than human advice sources. Surprisingly, no effects were observed for agree- 
able individuals in the algorithmic advice condition for weighting, though there 
were weak effects for judges. Strikingly, high NE and NC individuals reacted to 
algorithmic advice over all other advice sources, though the effect size is nearly 
doubled for NC individuals compared to NE individuals and is more statistically 
stable. We also saw a significant effect of personality conditioning behavior in 
decision making tasks, especially related to their trust in advice from an algo- 
rithm. 

Though a blend of significant and null results, we remain encouraged by 
our findings for two reasons. First, we uncovered strong evidence of personality 
influencing behavior and general levels of trust in automation, in line with our 
main goal. Given the newness of this topic, these results are useful for motivating 
future work on trust in automation and personality. Second, in line with [Gerber] 
[et al.] (2011), it would be unrealistic to expect all personality measures to explain 
all behavior. Of the Big 5 they note, “these traits have predictive power in an 
impressive variety of domains but are not universal predictors of all outcomes” 
(268). Our results corroborate this sentiment that personality plays a role in 
trust in automation, though it does not explain the breadth of general trust and 
behavior. 

Regarding generalizability, while people generally trust algorithmic advice 
relative to other advice sources, levels of trust are influenced by personality 
traits. As not all people retain the same personalities, not all people equally 
trust algorithms to make consequential decisions. 


5 Limitations and Future Directions 


While we offer a starting place for future work on personality and trust in 
automation, a key limitation of our study is focusing only on criminal jus- 
tice. Should we expect similar results in other subfields, such as automation in 
medicine, for example? Also, though [Dietvorst et al. (2015) demonstrate trust in 
algorithms wanes when mistakes are introduced, this phenomenon may be more 
likely for high NE individuals relative to high NC individuals, given the starting 
place of skepticism for high NE individuals. Further, algorithm aversion may not 
be detectable for high NC individuals, while it may drive levels of trust for high 
NE individuals. Or, do the other three Big 5 personality traits (extroversion, 
conscientiousness, and emotional stability) impact trust in automation? In sum, 
we suggest researchers in this realm consider personalities to provide a fuller 
picture of trust in a variety of subfields. 

An additional limitation that may be addressed in future work is the nature 
of MTurk respondents in general, in that they are typically higher educated and 
more liberal for example pors) 
and thus may be more likely to trust automation. Such a possibility suggests the 
potential for future and different samples to yield potentially different results. 
More analysis and experiments in this vein would deepen the impact of our initial 
findings in this research. 
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6 Concluding Remarks 


In this study, we have demonstrated that personality plays a strong role in im- 
pacting individuals! levels of trust in automation as they make public policy 
decisions. We bring psychology into the trust in automation discussion for sev- 
eral reasons. First, such an approach offers a baseline for understanding the role 
of innate, heritable characteristics and their influence on trust in automation] 
Such an understanding makes it clearer where to look for greater or lesser trust 
in algorithms, and where the basis of trust lies. These psychological characteris- 
tics are also widely used in many fields to describe human behavior both inside 
and outside of 
public policy. Given the rapid increase of algorithms and algorithmic advice in 
everyday life [2016], the role of psychological characteristics conditioning 
virtually all human behavior (Eysenck| |1963), and also the recent surge in re- 


search on algorithmic transparency (Rudin & Ustun| 2018], our study offers a 


timely exploration of the intersection of trust in automation and personality. 
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Appendix 
A Support Information 


A.1 Task Wording 


It has generally been found that even untrained individuals can do very well, 
sometimes even better than trained people or computer algorithms, at deter- 
mining the likelihood of a person committing another crime after their initial 
arrest. 

We are interested in knowing whether this accuracy can be further improved 
by combining individual judgement with the advice of crowds, experts, or algo- 
rithms. In what follows, you will be given an actual arrest record for a person 
arrested in Broward County, Florida. We already know whether the person com- 
mitted another crime within the next two years. You will be asked to give us a 
probability of the person re-offending along the following lines. 

We have collected advice from several sources: 


— Several Mechanical Turk surveys of people like yourself. 

— A judge with over 10 years of experience. 

— A machine learning algorithms, developed by computer scientists and crim- 
inal justice experts, that use historic recidivism data to predict probability 
of re-offending. 


Warning: There are attention checks in this survey. We reserve 
the right to deny payment if a participant fails these checks, as that 
indicates the participant is not actually doing the tasks. 


A.2 Defendant Profiles 


The defendant is a male aged 22. They have been charged with: Possession of 
Cocaine. This crime is classified as a felony. They have been convicted of 0 prior 
crimes. They have 0 juvenile felony charges and 0 juvenile misdemeanor charges 
on their record. 

The defendant is a male aged 38. They have been charged with: Manufac- 
turing Cannabis/Marijuana. This crime is classified as a felony. They have been 
convicted of 3 prior crimes. They have 0 juvenile felony charges and 0 juvenile 
misdemeanor charges on their record. 

The defendant is a male aged 23. They have been charged with: Grand Theft. 
This crime is classified as a felony. They have been convicted of 3 prior crimes. 
They have 0 juvenile felony charges and 0 juvenile misdemeanor charges on their 
record. 
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'The defendant is a male aged 27. They have been charged with: Possession 
of Meth. This crime is classified as a felony. They have been convicted of 5 prior 
crimes. They have 0 juvenile felony charges and 0 juvenile misdemeanor charges 
on their record. 

'The defendant is a male aged 24. They have been charged with: Driving with 
a Revoked License. This crime is classified as a felony. They have been convicted 
of 2 prior crimes. They have 0 juvenile felony charges and 0 juvenile misdemeanor 
charges on their record. 

The defendant is a female aged 33. They have been charged with: Child 
Neglect. T'his crime is classified as a felony. They have been convicted of 1 prior 
crimes. They have 0 juvenile felony charges and 0 juvenile misdemeanor charges 
on their record. 

The defendant is a male aged 22. They have been charged with: Disorderly 
Conduct. This crime is classified as a misdemeanor. They have been convicted of 
0 prior crimes. They have 0 juvenile felony charges and 0 juvenile misdemeanor 
charges on their record. 

The defendant is a male aged 24. They have been charged with: Resisting 
an Officer with Violence. This crime is classified as a felony. They have been 
convicted of 0 prior crimes. They have 0 juvenile felony charges and 0 juvenile 
misdemeanor charges on their record. 


A.3 Examples of Treatment 


A group of 200 people recruited from Mechanical Turk, on average rated the 
defendant as 80% likely to commit another felony crime within the next two 
years. 

Previously, you forecast that the defendant was [RESPONDENT’S PREVI- 
OUS FORECAST] likely to commit another felony crime within the next two 
years. 

If you would like to update your forecast, you can do so now. If 
not, just enter the same numbers as you entered previously. 

A judge with more than 10 years of experience rated the defendant as 80% 
likely to commit another felony crime within the next two years. 

Previously, you forecast that the defendant was [RESPONDENT’S PREVI- 
OUS FORECAST] likely to commit another felony crime within the next two 
years. 

If you would like to update your forecast, you can do so now. If 
not, just enter the same numbers as you entered previously. 

An algorithm developed by computer scientists and criminal justice researchers, 
based on a statistical analysis of thousands of past defendant records, rated the 
defendant as 80% likely to commit another felony crime within the next two 
years. 

Previously, you forecast that the defendant was [RESPONDENT’S PREVI- 
OUS FORECAST] likely to commit another felony crime within the next two 
years. 
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If you would like to update your forecast, you can do so now. If 
not, just enter the same numbers as you entered previously. 


A.4 MTurk Study Specifics 


In addition to the specifics of the design included in the manuscript, below are 
some additional specific items related to fielding the study on MTurk: 


1. Approval Rate: HIT Approval Rate > 95% 

2. Location: United States 

3. Study Description: Respondents will be asked to evaluate a series of real 
criminal profiles and asked to predict the likelihood of recidivism with and 
without the help of advice. 

4. Keywords: survey, criminal justice, forecasting, predication 


A.5 Personality Inventories 


A.5.1 NC and NE 

Please indicate the extent to which these statements are characteristic or un- 
characteristic of you (On a scale from 1 to 5, with being extremely characteristic 
and 5 being extremely uncharacteristic). 


1. I have opinions about almost everything. 

2. I like having responsibility for handling situations that require a lot of think- 
ing. 

It is very important to me to hold strong opinions. 

4. I often prefer to remain neutral about complex issues. 


= 


A.5.2 TIPI for Big 5 

Here are a number of personality traits that may or may not apply to you. Please 
indicate the extent to which you agree or disagree that these characteristics apply 
to you. You should rate the extent to which the pair of traits applies to you, 
even if one characteristic applies more strongly than the other. (On a scale from 
1 to 7, with 1 = strongly disagree and 7 = strongly agree). 


Extroverted, enthusiastic 

Critical, quarrelsome 

Dependable, self-disciplined 
Anxious, easily upset 

Open to new experiences, complex 
Reserved, quiet 

Sympathetic, warm 

Disorganized, careless 

Calm, emotionally stable 
Conventional, uncreative 


SO DO uU TCI ade ed bra 


= 
> 
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A.6 Trust in Automation Index 


Many organizations now use algorithms to make forecasts. Some high profile 
examples include the use of statistics in baseball to choose players (Moneyball) 
or Nate Silvers use of statistics to predict elections. To what extent do you agree 
or disagree with the following statements about algorithms? (On a scale from 
1 to 7, with 1 = strongly agree and 7 = strongly disagree). Given the variance 
in valence, items were coded so that the highest end of the response range (7) 
indicates high trust in automation and the lowest end of the range (1) indicates 
low trust. 


1. 
2. 


sing algorithms increases the chances of organizations achieving their goals. 

sing algorithms increases the effectiveness of organizations in making good 

ecisions. 

sing algorithms improves the output quality for organizations. 

sing algorithms makes it more likely for organizations to make errors. 

5. Modern organizations rely too much on algorithms to make decisions about 

he future. 

6. Using algorithms is an effective way to overcome human biases. 

7. When I am uncertain about something, I will trust the information from an 
algorithm in place of my own judgement. 

8. When I am uncertain about something, I will tend to trust my own intuition 
and judgement over the information from an algorithm. 


oe 
acade 


gaje 


A.7 Base Relationships: Empirical Motivation 


As an empirical motivation for our full study, we offer a short discussion of our 
base findings of relative influence of the treatment conditions in the experiment. 
We show the impact of advice from an algorithm or a judge relative to the base- 
line category of average past MTurk respondents for our two behavioral measures 
of trust in Table |3} advice weight and distance to advice. The strong positive 
impacts from the first model (column 1) for each condition suggest respondents 
are reacting to the advice, with the magnitude of the effect in the algorithm con- 
dition nearly twice that of the judge condition. Second, the pronounced negative 
effects in the second model (column 2) demonstrate the impact of the algorithm 
and judge treatments on reducing the distance between respondents’ predictions 
and the advice-giver relative to the baseline category. Similarly, the effects are 
nearly doubled in the algorithm condition. 

'These results demonstrate two things. First, respondents were significantly 
more likely to change their evaluations based on the advice of “experts,” whether 
human or machine-derived than they were to trust the ^wisdom of the crowd." 
And second the algorithm condition is where the strongest effects are observed, 
suggesting respondents trust algorithms to a greater degree than advice from 
humans. This is an important finding by itself, and one we explore in greater 
detail in a separate paper. 
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'Table 3. Experimental Impacts on Dependent Variables of Interest 


Dependent variable: 


Advice Weight Distance to Advice 


(1) (2) 

Algorithm Condition 0.134*** (0.013) | —6.563*** (0.864) 
Judge Condition 0.073*** (0.013) | —3.812*** (0.857) 
Constant 0.156*** (0.009) — 27.353*** (0.608) 
Observations 3,274 3,274 

R? 0.031 0.017 
Adjusted R? 0.030 0.017 
Residual Std. Error (df — 3271) 0.305 20.113 

F Statistic (df — 2; 3271) 52.076*** 29.117*** 


*p«0.1; **p«0.05; ***p<0.01 
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The book entitled An Introduction to Nonparametric Statistics 
presents an exceptional introduction to nonparametric statistical tech- 
niques. To make it easier to read, I divided this book review into two sections: 
book description and comparisons with other books on nonparametric statistics. 


1 Book description 


According to the author, the book's target audience is graduate students in 
the applied statistical field. I find the book also suitable for graduate students 
researching applied mathematics and theoretical statistics. To make the best 
use of the book, the reader is expected to have a good knowledge of differential 
and multivariate integral calculus, matrix algebra, probability (mainly mean and 
variance of random variables), and classical inferential statistics. 

A striking feature of the book is the logical and chronological chain of ideas. 
I classify the book as logical because it bothers to review the parametric statis- 
tics part corresponding to the nonparametric counterpart that will be covered 
immediately afterward. This logical approach favors learning because many un- 
dergraduate students study the parametric statistics before the nonparametric 
counterparts, and students can promptly relate what they previously learned 
in parametric statistics to what they study in nonparametric statistics. Thus, 
the book teaches nonparametric statistics more naturally and logically using the 
connection with the parametric methods. The first part of each chapter, except 
for two chapters: Chapter 1 (this is a leveling chapter of the book) and Chapter 8 
(this is a chapter more focused on empirical probability density function graph- 
ing techniques), provides a brief review of the parametric technique equivalences 
to the nonparametric techniques that will be presented below. 
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I say that the book is chronological because it presents contents of non- 
parametric statistics in the same order as the articles that proposed such non- 
parametric techniques, favoring the understanding of the contents as the degree 
of complexity of the techniques increases gradually. For example, the author 
is constantly concerned with presenting the popular and traditional nonpara- 
metric tests: sign test, Mann-Whitney-Wilcoxon test, and Kruskal-Wallis test 
in chronological order, which are also included in other nonparametric statistics 
books. 

'The book has a brief introduction, 10 chapters and 2 appendices. The in- 
troduction leaves out something to be desired as to the fact that it does not 
list all the packages that will be used later in the text. It is good practice to 
list all packages used in the book, as it is helpful for the reader to prepare the 
computer environment in advance. It is also convenient to allow a university's 
computer lab with all the necessary packages already installed on the comput- 
ers. For accessibility, I list here all the packages needed to run the R 
code in the book: MultNonParam, NonparametricHeuristic, dev- 
tools, BSDA, exactRankTests, DescTools, CuM2SL2Test, clinfun, muStat, crank, 
deming, Hotelling, ICSNP, KernSmooth, quantreg, MASS, boot, bootstrap and 
VGAM 

Chapter 1 reviews the probability density functions and properties of random 
variables used throughout the book as well as the definitions and results of 
classical statistical inference (confidence intervals and hypothesis testing). Thus, 
this chapter is essential for the textbook to be self-sufficient, as far as possible, 
within the area of nonparametric statistics. 

'The content of nonparametric statistics starts from Chapter 2, which covers 
nonparametric inferential techniques for a single sample, such as the sign test. 

Chapter 3 presents nonparametric inferential techniques for the case of two 
samples, and the theory behind the other tests and rank tests, is presented. Also, 
in Chapter 3, the widely used nonparametric tests such as Mann-Whitney and 
Mann-Whitney-Wilcoxon tests are presented; Mood's median test is presented 
for didactic purposes. 

Chapter 4 extends the results presented to the case where we have three or 
more groups, explaining in detail the Kruskal-Wallis test. Nonparametric tech- 
niques analogous to the ANOVA (Analysis of Variance) of parametric inference 
are presented in Chapter 5. 

Chapter 6 comments on the limitations of interpreting the Pearson corre- 
lation coefficient and aims to present alternative correlation measures to the 
Pearson correlation coefficient, which are the correlation coefficients proposed 
by Spearman and Kendall. 

Chapter 7 presents techniques to perform multivariate nonparametric in- 
ference (context where observations contain more than two variables), such as 
population position parameter vector estimation and hypothesis testing about 
this same parameter vector. 


1 Note: some packages are already installed along with R or need to be installed from 
GitHub or CRAN. 
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Chapter 8 is the shortest in the book and presents techniques for estimating a 
sample's probability density function and plotting empirical probability density 
functions. The histogram graph is also presented in detail in this chapter. 

Alternative regression techniques such as Kernel regression, Local regression, 
Isotonic regression, Quantile regression, and “Resistant regression” are presented 
in Chapter 9. When commenting on "Resistant regression", the author missed 
the opportunity to present basic results about the classes of M-estimators and 
even to comment on the existence of other classes of estimators (L-estimators, 
S-estimators). Knowing the specific terminology would be useful if the reader 
were interested to learn more about robust estimators. The topic of Splines is 
also covered in this chapter. 

The final chapter introduces two resampling techniques: bootstrap and Jack- 
knife. Several types of bootstrap are presented, as well as the advantages and 
disadvantages of each type of bootstrap. 

Appendix A presents the codes in SAS equivalent to the codes presented in 
R throughout the course. Appendix B contains the codes in R used to generate 
the various tables and graphs presented throughout the text to support the 
author's argument. Thus, Appendix B makes the material presented in the book 
reproducible by any reader with basic knowledge of R. 

Two packages of R are used extensively throughout the book, to illustrate 
the results of different tests and nonparametric techniques or to do data analy- 
sis, and they are MultNonParam e NonparametricHeuristic. The MultNonParam 
package is available on CRAN, and the NonparametricHeuristic package is avail- 
able on GitHub. The book's author created both packages. As the package names 
suggest, NonparametricHeuristic is useful for teaching nonparametric statistics 
while MultNonParam is actually used in data analysis. 


2 Comparisons to Nonparametric Statistical Methods 
Using R 


An important component of the book is illustrating applications of the techniques 
to real data through free software R. This greatly enriches the text, as the text 
presents both theory and applications as well as provides the code in R and SAS 
so that the reader can carry out their own statistical analyses. Other books also 
follow this same approach of presenting code for nonparametric data analysis 
using R, such as Nonparametric Statistical Methods Using R 
2015). I now compre the book Introduction to Nonparametric Statistics (INS) 
to the book Nonparametric Statistical Methods Using R (NSMUR). Below, I list 
the point-by-point similarities and differences between them. 


2.1 Similarities 


1. Both books illustrate the theory through applications using R. 
2. Both books provide the code for the examples. 


Book Review: An Introduction to Nonparametric Statistics 127 


3. Both books are concise, INS has about 200 pages of content while NSMUR 
has about 250 pages of content. 

4. Both books cover resampling techniques such as bootstrapping. 

5. Both books cover the most widespread nonparametric techniques while hav- 
ing additional topics of its own. For instance, INS provides a brief intro- 
duction to the concepts of local regression, isotonic regression, and quantile 
regression. NSMUR has a chapter dedicated to techniques for survival anal- 
ysis such as Kaplan-Meier estimator and log rank test as well as a chapter 
dedicated to cluster correlated data analysis. 


2.2 Differences 


= 


. Only NSMUR includes a chapter reviewing basic R topics. 

. Only INS presents the codes in SAS. 

3. Only INS makes a chapter reviewing basic topics of probability and classical 
statistical inference. 

4. NSMUR has more exercises than INS, which is an advantage for the reader 
who likes to fix learning through exercise solving. 

5. Only INS features Jackknife. 


N 


The INS book integrates the examples and R codes with the text, making 
the reading very fluid. Compared to more traditional books like|Hettmansperger 
(1984) and |Hettmansperger and McKean| (2011), it is notable that INS is more 


focused on the applied part of nonparametric statistics and offers more com- 
ments on R code. In summary, the INS and NSMUR books fulfill the role of 
introductory books on nonparametric statistics very well. 

It would be interesting to see in a possible next edition of the INS an extension 
of the chapter dealing with alternative regression methods, approaching isotonic 
regression and quantile regression with a little more theoretical depth. 
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